Module 1

Jump into Relational databases. A quick re-introduction to db browser and relational databases

Game Plan

At this point everyone should have the class technology up and running on their machines. We are going to learn the basics ETL using mostly RStudio, but we will start off with concepts learned in ACC3003, relational databases and excel. You should have experience with relational databases and excel, but if you have forgotten do not worry. We will pick up here and discover how R can push these technologies further. Lets get started with Relational databases.

This case study is provided by PWC. It was developed for a masters level course, but I am adapting it for our class. I will provide all of the knowledge based content because it is important to get an understanding of what accounting firms are looking for, but we may only discuss portions in class.

Lets Get Started

Module 1 Objectives

Review

Data Analytics Process

What is a Relational Database

A relational database is a collection of data that organizes data in tables and maintains relationships between the tables for information retrieval

A relational database management system (RDBMS) is computer software that enables users to create, modify, and analyze data in a relational database

Enterprise RDBMS include:

Lightweight RDBMS include:

Key Terms

Tables

Relational Tables

Primary Key

Relationship Types

One-to-One

One-to-Many

Many-to-Many

Entity Relationship Diagram

Other Information in PWC Case Study

Structured Query Language

What is SQL

SQL Basics

SQLite and DB Browser for SQLite

www.sqlite.org:

sqlitebrowser.org:

SQLite/DB4S Basics

Comments

Chinook Sample Database

SYNTAX

SELECT statement

\(\color{red}{\text{SELECT}}\) column1, column2, column3 \(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{SELECT * FROM}}\) table_name

\(\color{red}{\text{SELECT}}\) column1, column2, column3 \(\color{red}{\text{FROM}}\) table_name \(\color{red}{\text{LIMIT}}\) 5

CREATE TABLE statement

\(\color{red}{\text{CREATE TABLE}}\) new_table \(\color{red}{\text{AS}}\)

\(\color{red}{\text{SELECT}}\) column1, column2, column3 \(\color{red}{\text{FROM}}\) table_name

ORDER BY

The \(\color{red}{\text{ORDER BY}}\) keyword can be added to a SELECT statement to sort the records

The typical default order is ascending (ASC), but you can also specify descending (DESC)

\(\color{red}{\text{SELECT}}\) column1, column2, column3 \(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{ORDER BY}}\) column2, column1, column3 \(\color{red}{\text{DESC}}\)

COUNT

\(\color{red}{\text{SELECT}}\) COUNT(*) \(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{SELECT}}\) COUNT(1) \(\color{red}{\text{FROM}}\) table_name

Exercise #1

Answer the following questions using the Chinook Database in DB Browser \(\color{red}{\text{(Do not peek at the answers 👿 )}}\):

  1. How many columns does the customer table have?1

Answer 1

Answer 2

  1. If you select only the first 5 rows, what is the country in the last row?

Answer 1

Answer 2

  1. If you sort the table on last name, what is the last name in the third row?

Answer 1

Answer 2

  1. How many rows are in the invoice table?

Answer 1

Answer 2

WHERE clause

\(\color{red}{\text{SELECT}}\) column1, column2, column3 \(\color{red}{\text{FROM}}\) table_name \(\color{red}{\text{WHERE}}\) column1 = ‘ABC’

\(\color{red}{\text{SELECT}}\) column1, column2, column3 \(\color{red}{\text{FROM}}\) table_name \(\color{red}{\text{WHERE}}\) column2 > 100

\(\color{red}{\text{SELECT}}\) column1, column2, column3 \(\color{red}{\text{FROM}}\) table_name \(\color{red}{\text{WHERE}}\) column3 != 0

Comparative Operators

Comparative Operators
\(\color{red}{\text{Operator}}\) \(\color{red}{\text{Operator Description}}\) \(\color{red}{\text{Example (a=10, b=20)}}\)
= Checks if the values of two operands are equal or not, if yes then condition becomes true (a= b) is not true
!= Checks if the values of two operands are equal or not, if values are not equal then condition becomes true (a != b) is true
<> Checks if the value of two operands are equal or not, if values are not equal then condition becomes true (a <> b) is true
> Checks if the value of left operand is greater than the value of right operand, if yes then condition becomes true (a > b) is not true
< Checks if the value of left operand is less than the value of right operand, if yes then condition becomes true (a < b) is true

AND/OR

\(\color{red}{\text{SELECT}}\) column1, column2, column3 \(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{WHERE}}\) a_condition \(\color{red}{\text{AND}}\) another_condition

\(\color{red}{\text{SELECT}}\) column1, column2, column3 \(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{WHERE}}\) a_condition \(\color{red}{\text{OR}}\) another_condition

IN

\(\color{red}{\text{SELECT}}\) column1, column2, column3 \(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{WHERE}}\) column1 \(\color{red}{\text{IN}}\) (value1, value2, …, valueN)

\(\color{red}{\text{SELECT}}\) column1, column2, column3 \(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{WHERE}}\) column1 = value1 \(\color{red}{\text{OR}}\) column1 = value \(\color{red}{\text{OR}}\)\(\color{red}{\text{OR}}\) column1 = valueN

BETWEEN

\(\color{red}{\text{SELECT}}\) column1, column2, column3 \(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{WHERE}}\) column2 \(\color{red}{\text{BETWEEN}}\) value1 \(\color{red}{\text{AND}}\) value2

LIKE

\(\color{red}{\text{SELECT}}\) column1, column2, column3 \(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{WHERE}}\) column2 \(\color{red}{\text{LIKE}}\)\(\color{red}{\text{%}}\)value1\(\color{red}{\text{%}}\)

\(\color{red}{\text{Wildcard}}\) \(\color{red}{\text{Description}}\)
% A substitute for zero or more characters
_ (underscore) A substitute for a single character
[charlist] Sets and ranges of characters to match
[^charlist] Matches only a character NOT specified within the brackets

NULL and NOT

\(\color{red}{\text{SELECT}}\) column1, column2 \(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{WHERE}}\) column2 \(\color{red}{\text{IS NULL}}\)

\(\color{red}{\text{SELECT}}\) column1, column2 \(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{WHERE}}\) column2 \(\color{red}{\text{IS NOT NULL}}\)

Exercise #2

Answer the following questions using the Chinook Database in DB Browser \(\color{red}{\text{(Do not peek at the answers 👿 )}}\): 2

  1. How many invoices have totals greater than $20?

    Answer 1

    Answer 2

  2. How many invoices have totals greater than $10 and less than $20?

    Answer 1

    Answer 2

  3. How many invoices are from customers in Brazil, Argentina, or Chile?

    Answer 1

    Answer 2

  4. How many invoices have a billing country that starts with “C”?

    Answer 1

    Answer 2

  5. How many customers list their company?

    Answer 1

    Answer 2

Expressions

SELECT column1, end_bal, end_bal – beg_bal \(\color{red}{\text{AS}}\) “difference”

FROM table_name

Arithmetic operators

\(\color{red}{\text{Operator}}\) \(\color{red}{\text{Description}}\)
+ Adds values on either side of the operator
- Subtracts right hand operand from left hand operand
* Multiplies values on either side of the operator
/ Divides left hand operand by right hand operand
% Divides left hand operand by right hand operand and returns remainder (“modulus”)

String functions

\(\color{red}{\text{RTRIM(<string>)}}\)

\(\color{red}{\text{LTRIM(<string>)}}\)

\(\color{red}{\text{SUBSTR(<string>,<start location>,<length>)}}\)

\(\color{red}{\text{LENGTH(<string>)}}\)

String functions (continued)

\(\color{red}{\text{UPPER(<string>)}}\)

\(\color{red}{\text{LOWER(<string>)}}\)

\(\color{red}{\text{REPLACE(<string value>,<string to replace>,<replacement>) }}\)

\(\color{red}{\text{SELECT}}\) column1, \(\color{red}{\text{REPLACE}}\) (column1, ‘/’, ‘-’) \(\color{red}{\text{AS}}\) new_col

\(\color{red}{\text{FROM}}\) table_name

Numeric functions

\(\color{red}{\text{SELECT}}\) column1, \(\color{red}{\text{ROUND}}\) (column1, 2) \(\color{red}{\text{AS}}\) new_col

\(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{SELECT}}\) *, \(\color{red}{\text{ABS}}\)(\(\color{red}{\text{RANDOM}}\)() % 100) \(\color{red}{\text{AS}}\) new_col

\(\color{red}{\text{FROM}}\) table_name

Data functions

  • Different SQL implementations have varying functions and operators for dates and times

  • In SQLite, date objects are created by DATE and related functions, and arithmetic operators are used to do calculations involving dates

  • DATE(‘now’) returns the current date and time of the server

  • JULIANDAY interprets dates in the standard YYYY-MM-DD format

  • STRFTIME provides flexible options for managing dates and times using substitutions shown at right

\(\color{red}{\text{Sub}}\) \(\color{red}{\text{Description}}\)
%d day of month: 00
%f fractional seconds: SS.SSS
%H hour: 00-24
%j day of year: 001-366
%J Julian day number
%m month: 01-12
%M minute: 00-59
%s seconds since 1970-01-01
%S seconds: 00-59
%w day of week 0-6 with Sunday = 0
%W week of year: 00-53
%Y year: 0000-9999

Example - Date functions

\(\color{red}{\text{SELECT JULIANDAY}}\)(‘now’) - \(\color{red}{\text{JULIANDAY}}\)(‘1776-07-04’);

\(\color{red}{\text{SELECT}}\) column1, \(\color{red}{\text{STRFTIME}}\)(‘%Y-%m-%d’, column1) \(\color{red}{\text{AS}}\) new_col

\(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{SELECT}}\) column1, \(\color{red}{\text{STRFTIME}}\)(‘%w’, column1) \(\color{red}{\text{AS}}\) new_col

\(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{SELECT}}\) column1, \(\color{red}{\text{STRFTIME}}\)(‘%W’, column1) \(\color{red}{\text{AS}}\) new_col

\(\color{red}{\text{FROM}}\) table_name

CAST

\(\color{red}{\text{SELECT}}\) column1, \(\color{red}{\text{CAST}}\)(column1 \(\color{red}{\text{AS INT}}\)) \(\color{red}{\text{AS}}\) int_col \(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{SELECT}}\) column2, \(\color{red}{\text{CAST}}\)(column2 \(\color{red}{\text{AS CHAR}}\)) \(\color{red}{\text{AS}}\) char_col \(\color{red}{\text{FROM}}\) table_name

Exercise #3

Answer the following questions using the Chinook Database in DB Browser \(\color{red}{\text{(Do not peek at the answers 👿 )}}\):

  1. How many employees does Chinook have?

    Answer 1

    Answer 2

  2. Create a new column that shows their phone numbers without the area code.

    Answer 1

    Answer 2

  3. How long is the longest last name?

    Answer 1

    Answer 2

  4. How many employees were born on a Sunday?

    Answer 1

    Answer 2

  5. How old was the oldest employee when they were hired?

    Answer 1

    Answer 2

DISTINCT

\(\color{red}{\text{SELECT DISTINCT}}\) column1 \(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{SELECT COUNT}}\)(\(\color{red}{\text{DISTINCT}}\) column1) \(\color{red}{\text{FROM}}\) table_name

Aggregate functions

\(\color{red}{\text{Function}}\) \(\color{red}{\text{Description}}\)
COUNT() Counts the total number of records
SUM() Sum of numeric values
AVG() Finds the average
MIN() Returns the smallest number
MAX() Returns the largest number

GROUP BY clause

\(\color{red}{\text{SELECT}}\) column1, column2, count(1) as ct, sum(column1) as sum1

\(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{WHERE}}\) a_condition

\(\color{red}{\text{GROUP BY}}\) column1, column2

\(\color{red}{\text{ORDER BY}}\) column2, column1

Example - GROUP BY with aggregate function

Having clause

\(\color{red}{\text{SELECT}}\) column1, column2, count(1) as counts \(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{GROUP BY}}\) column1, column2 \(\color{red}{\text{HAVING}}\) counts > 5

\(\color{red}{\text{SELECT}}\) column1, count(1) as counts, sum(column2) as sum2

\(\color{red}{\text{FROM}}\) table_name \(\color{red}{\text{GROUP BY}}\) column1

\(\color{red}{\text{HAVING}}\) counts > 1 and sum2 > 10000

CASE

\(\color{red}{\text{SELECT}}\) column1, column2,

\(\color{red}{\text{CASE}}\) <test expression>

\(\color{red}{\text{WHEN}}\) <comparison expression1> \(\color{red}{\text{THEN}}\) <return value1>

\(\color{red}{\text{WHEN}}\) <comparison expression1> \(\color{red}{\text{THEN}}\) <return value1>

\(\color{red}{\text{ELSE}}\) <default value>

\(\color{red}{\text{END AS}}\) new_field_name

\(\color{red}{\text{FROM}}\) table_name

\(\color{red}{\text{SELECT}}\) column1, \(\color{red}{\text{SUM}}\)(

\(\color{red}{\text{CASE}}\) <test expression>

\(\color{red}{\text{WHEN}}\) <comparison expression1> \(\color{red}{\text{THEN 1 ELSE 0}}\)

\(\color{red}{\text{END AS}}\) new_field_name

) \(\color{red}{\text{FROM}}\) table_name \(\color{red}{\text{GROUP BY}}\) column1

Exercise #4

Answer the following questions using the Chinook Database in DB Browser \(\color{red}{\text{(Do not peek at the answers 👿 )}}\):

  1. Return a list of countries by the total value of invoices in descending order.

    Answer 1

  2. How many invoice line items are there?

    Answer 1

  3. How many distinct prices are there?

    Answer 1

  4. How many downloads of each price have happened?

    Answer 1

  5. Which tracks have been downloaded the most times?

    Answer 1

  6. What is the range of invoice dates covered by the database?

    Answer 1

    Answer 2

JOIN clause

A JOIN combines fields from two tables using a key column shared by the tables

SELECT table_a.column1, table_a.column2, table_b.column3

FROM table_a \(\color{red}{\text{JOIN}}\) table_b ON \(\color{green}{\text{table_a.column1}}\) = \(\color{green}{\text{table_b.column1}}\)

\(\color{GREEN}{\text{*KEY*}}\)

Aliases

SELECT Employee.BusinessEntityID, FirstName, LastName, JobTitle, BirthDate FROM HumanResources.Employee INNER JOIN Person.Person ON \(\color{orange}{\text{Employee.}}\)BusinessEntityID = \(\color{orange}{\text{Person.}}\)BusinessEntityID

SELECT e.BusinessEntityID, FirstName, LastName, JobTitle, BirthDate FROM HumanResources.Employee \(\color{green}{\text{AS e}}\) INNER JOIN Person.Person \(\color{green}{\text{AS p}}\) ON \(\color{green}{\text{e.}}\)BusinessEntityID = \(\color{green}{\text{p.}}\)BusinessEntityID

Join types

INNER join

LEFT join

RIGHT join

FULL OUTER join

Example - Join with GROUP BY

Combining queries

\(\color{red}{\text{SELECT}}\) * \(\color{red}{\text{FROM}}\) table_name \(\color{red}{\text{WHERE}}\) column1 >

(\(\color{red}{\text{SELECT AVG}}\)(column1) \(\color{red}{\text{FROM}}\) table_name);

More complex queries with multiple layers of nesting may require the use of aliases

\(\color{red}{\text{SELECT COUNT}}\)(*) \(\color{red}{\text{FROM}}\) (\(\color{red}{\text{SELECT}}\) * \(\color{red}{\text{FROM}}\) table_name \(\color{red}{\text{WHERE}}\) column1 >

(\(\color{red}{\text{SELECT AVG}}\)(column1) \(\color{red}{\text{FROM}}\) table_name)) \(\color{red}{\text{temp}}\);

Exercise #5

Answer the following questions using the Chinook Database in DB Browser \(\color{red}{\text{(Do not peek at the answers 👿 )}}\):

  1. Return the top five most downloaded tracks with the full name of the track and number of downloads.

    Answer 1

  2. How many distinct albums have actually been downloaded?

    Answer 1A

    Answer 1B

    Answer 1C

  3. What is the most popular genre?

    Answer 1

Summary of Analytics Workflow

Acquire data

\(\color{red}{\text{Task}}\) \(\color{red}{\text{Description}}\) \(\color{red}{\text{SQL}}\)
Data access Connect to a data source
Importing data Read the data into an analytical environment New Database > File > Import > Table from CSV File
Data profiling Review data dimensions and summary statistics

COUNT

DISTINCT

MIN, MAX, SUM, AVG, etc.

ORDER BY

Data quality assessment Identify aspects of the data that pose challenges for subsequent analysis

IS NULL

IS NOT NULL

Data simulation Generate data based on analytical requirements RANDOM

Transform data

\(\color{red}{\text{Task}}\) \(\color{red}{\text{Description}}\) \(\color{red}{\text{SQL}}\)
Cleaning data Address data quality issues to facilitate analysis

LTRIM, RTRIM, UPPER, LOWER, etc.

ROUND

Changing data types Convert a value to the appropriate format for analysis CAST AS
Filtering data Create subsets of records and features based on specified conditions

WHERE

IN, BETWEEN, LIKE

HAVING

Deriving data Create new features from original features

AS

REPLACE, SUBSTR, etc.

JULIANDAY, etc.

CASE, IF

Scaling data Put features with different ranges of values on the same scale while preserving relative values MIN, MAX, SUM, AVG, etc.

Transform data (continued)

\(\color{red}{\text{Task}}\) \(\color{red}{\text{Description}}\) \(\color{red}{\text{SQL}}\)
Sampling data Create subsets of records based on a probability distribution RANDOM
Aggregating data Return a statistic or value for one feature according to different values of another feature

GROUP BY

MIN, MAX, SUM, AVG, etc.

Reshaping data Change whether values are represented in different records or different features
Concatenating data Combine data sets through juxtaposition

UNION

UNION ALL

Merging data Combine data sets by matching records on a common identifier

INNER JOIN

LEFT JOIN

Analyze data

\(\color{red}{\text{Task}}\) \(\color{red}{\text{Description}}\) \(\color{red}{\text{SQL}}\)
Summary analysis Calculate representative statistics for features of interest MIN, MAX, SUM, AVG, etc.
Perform statistical tests Estimate the probability that the data supports a specific claim
Clustering Identify similar groups of records
Predictive modeling Use one set of features to predict the value of another feature
Network analysis Examine relationships between entities

Present findings

\(\color{red}{\text{Task}}\) \(\color{red}{\text{Description}}\) \(\color{red}{\text{SQL}}\)
Data visualization Display data using lines, shapes, colors, and other abstract representations
Dashboarding Create a collection of dynamic visualizations
Exporting data Produce output from an analytical environment for future use

Copy/paste

File > Export > Table(s) as CSV file…

Make recommendations Use results of data analysis to guide decision-making

ASSIGNMENT

I would like you to turn in a word document with full screen shots of each of the exercises.

SEE CANVAS FOR ASSIGNMENT DETAILS

Additional Resources

Tutorial for SQLite


  1. I did not find a great SQL solution. If you find a better one please let me know.↩︎

  2. These are not the only answers↩︎