This module is a reintroduction to Excel.
I may trash talk excel đ, but I would argue that it is the most used tool for accountants. Accountants are the original data scientists^{1} and their weapon of choice has been excel. A large portion of this class will be using R, BUT I hope to show you that R and Excel can work together. At the very least you will learn how to do everything we see in this module in R and you can decide which tool you want to use. đ This case study is provided by PWC. It was developed for a masters level course, but I am adapting it for our class. This should be a Review and I plan to move through this quickly. But, as always, you can slow me down with questions and comments.
Demonstrate knowledge of data quality concepts
Demonstrate knowledge of key terms and capabilities in Excel
Demonstrate how to use Excel to acquire, transform, analyze, and visualize data
Demonstrate knowledge of leading practices for presenting findings in Excel
ButâŚ
Keep it simple and ask the Internet!
\(\color{red}{\text{Term}}\)  \(\color{red}{\text{Definition}}\) 

Workbook  An Excel file is referred to as a workbook 
Worksheet  A single âtabâ or âsheetâ within a Workbook 
Cell  The boxes within the worksheet where information is stored. Cells are referenced by column letters (A, B, C, âŚ) and row numbers (1, 2, 3, âŚ) like a map: 


Range  A contiguous set of cells referenced by the top left and bottom right cells, separated by a colon (:): 

Excel formulas can reference different cells
Changes to the referenced cells result in updates to value calculated by the formula
The two example below provide the same result:
=SUM(C1,C2,C3)  Single cells selection \(\color{red}{\text{Use the Ctrl Key}}\) 


=SUM(C1:C3)  Range selection \(\color{red}{\text{use Shift Key or Drag with mouse}}\) 
Note: should a row be inserted between C1 and C3, then example 1 will still provide the same results, however example two will extend to add 4 cells.
Absolute referencing is a way of referring to cells within formulas so that once copied, the cell reference remains fixed to a particular cell
Relative referencing As a formula is copied along, the row/column numbers adjust accordingly 

Absolute referencing As a formula is copied along, it continues to refer to the same cell as before 
With absolute cell referencing, a dollar sign ($) appears in the cell reference:
But this will vary if the formula is copied across a range of cells \(\color{red}{\text{(relative reference)}}\)
\(\color{red}{\text{(absolute reference)}}\)
Keyboard shortcut F4 cycles through the four absolute cell reference options
Excel calculations are specified with formulas in each cell
To create a formula, type an equals sign (=) followed by the function and required arguments
Order of Operations
Brackets  ( ) 
Exponent  ^ 
Division/Multiplication  / * 
Addition/Subtraction  +  
The fx button to the left of the formula bar opens a list of all available functions
Functions take zero or more inputs or arguments
=FUNCTION(Input1,Input2,âŚ)
Arguments can be:
Values  1, 2, 3 or 1.2, 3.6, 2.1 or âxâ, âyâ, âzâ or TRUE, FALSE 
Cell References  B91, A452, C3 
Ctrl + âĄď¸(âŹ ď¸,âŹď¸,âŹď¸)  navigate to the last nonempty cell in that direction 
Ctrl + Shift + âĄď¸(âŹ ď¸,âŹď¸,âŹď¸)  select a block of cells to last nonempty cell in that direction 
Ctrl + Space  select the entire column of the cell you have selected 
Shift + Space  select the entire row of the cell you have selected 
Excel files have the extension .xlsx or .xls (older versions)
Excel can also import data from delimited text files
Some commaseparated values files can be opened directly by Excel
Tabseparated values can be copyandpasted directly into an Excel worksheet
Delimited text can also be copyandpasted into Excel and then separated with texttocolumns
Follow the steps prompted when clicking the icon in the Data tab. You can specify:
Highlight a range and use the Sort option to order records by one or more columns
Make sure to select an entire table to avoid sorting only part of it
Filtering allows you to hide rows in a range if they donât match select criteria
Highlight the range and click the âFilterâ button
Custom filters allow you to use multiple criteria such as AND or OR
COUNT()  Counts numbers
COUNTBLANK()  Counts empty cells
COUNTA()  Counts nonempty cells
MIN()  Returns minimum
MAX()  Returns maximum
AVERAGE()  Returns average/mean
ROWS/COLUMNS()  Return number of rows/columns
Highlighting a range of cells returns summary statistics on the status bar in the lower right corner
This is a quick way to get information about part of your data
Identify which of the six DQ dimensions applies to the issues described below \(\color{red}{\text{(please do not peek at the answers đ)}}\):
âGenderâ field has the special characters like ~!@#$%^*();
Answer
[1] "Validity"
âFirst nameâ is blank or Null
Answer
[1] "Completeness"
âLast nameâ field has only designators such as LLP, LLC, Mr., Mrs., etc
Answer
[1] "Accuracy"
âAddressâ field has only numbers
Answer
[1] "Validity"
âAccount Typeâ field does not have predefined list of values
Answer
[1] "Consistency"
âAccount Numberâ field have duplicate values
Answer
[1] "Integrity"
âForex rateâ field does not have up to date exchange rate
Answer
[1] "Timeliness"
Load the data from âpayroll_data.txtâ and âreference_data.txtâ into two tabs in an Excel spreadsheet and save it as âPayroll Data_<NETID>.xlsxâ \(\color{red}{\text{(please do not peek at the answers)đ}}\) :
How did you import the data?
Answer
How many rows are there?
Answer
[1] "payroll_data = 202 rows including header"
[2] "reference_data = 199 rows including header"
How many columns?
Answer
[1] "payroll_data = 9 columns" "reference_data = 3 columns*"
What is the average salary?
Answer
How many missing values are in each column?
Answer
What potential data quality issues do you find? Which dimensions are they related to?
Answer (one example)
Data entered an Excel worksheet can be represented in different formats, for example:
Dates are stored as the number of days from a fixed historical date
Appends two or more strings
Extracts a specific number of characters starting from a given position
Extracts a specific number of characters from the left of a cell
Extracts a specific number of characters from the right of a cell
Return the number of characters (including spaces) in a cell
Replaces a string with another string
Return the position of a match and error if no match
Rounds a figure to a specified number of digits
Returns the kth largest number
Returns the kth smallest number
Multiplies several values together
Returns Eulerâs number (e) raised to a number
Returns a random number between 0 and 1
Returns a random integer between the specified values
Displays current date
Displays current date and time
Generates a date given day, month, and year
Determines day of date, e.g.Â 31
Determines month of date, e.g.Â 12
Determines year of date, e.g.Â 2001
Returns the week number in the year
Returns the position of the day in a workweek
Returns number of days between two dates
Returns the number of working days between dates
You can apply a formula to an entire column by using the Fill options
You can also Fill Down by doubleclicking a cell with a formula in a table
Determine if cell or result of formula is showing #N/A!
Determine if cell or result of formula is an error
Give an alternative result if the formula produces an error
IF() analyses the contents of one or more cells to determine whether a condition is TRUE or FALSE
Given the value of the condition, different values can be returned
Syntax
=IF ( \(\color{darkred}{\text{logical_test}}\) , \(\color{red}{\text{value_if_true}}\) , \(\color{maroon}{\text{value_if_false}}\) )
Examples
=IF ( \(\color{darkred}{\text{B1 > 50}}\) , \(\color{red}{\text{B1 * A1}}\), \(\color{maroon}{\text{B1 * A3}}\))
=IF ( \(\color{darkred}{\text{B1 > B2}}\) , \(\color{red}{\text{B1}}\) , \(\color{maroon}{\text{B2}}\) )
=IF ( \(\color{darkred}{\text{B1 = "awesome"}}\) , \(\color{red}{\text{B1}}\) , \(\color{maroon}{\text{B2}}\) )
Any function can be used with conditional operators in IF statements to test conditions
> Greater than
< Less than
>= Greater than and equal to
<= Less than and equal to
<> Not equal to
= Equal to
Some of functions can be used only on numbers, some to dates and some to strings.
=IF (\(\color{darkred}{\text{logical_test}}\) , \(\color{red}{\text{value_if_true}}\) , IF (\(\color{darkred}{\text{logical_test}}\) , \(\color{red}{\text{value_if_true}}\) , \(\color{maroon}{\text{value_if_false}}\) ))
=IF ( \(\color{darkred}{\text{B1 > 50}}\) , \(\color{red}{\text{B1 * A1}}\), IF ( \(\color{darkred}{\text{B1 > 150}}\) , \(\color{red}{\text{B1 * A2}}\), \(\color{maroon}{\text{B1 * A3}}\)) )
=AND(\(\color{darkred}{\text{logical1}}\), \(\color{red}{\text{logical2}}\), âŚ) Gives True if all the conditions are met
=OR(\(\color{darkred}{\text{logical1}}\), \(\color{red}{\text{logical2}}\), âŚ) Gives True if at least one condition is met
=NOT(\(\color{darkred}{\text{logical}}\)) Gives True if the condition is not met
Always use absolute cell references to specify the range containing the lookup data
You may need to find or replace values across a worksheet or workbook, which would be timeconsuming if done manually
Use Find/Replace to do this efficiently:
Ctrl + F: find
Ctrl + H: replace
Excel will search the whole worksheet (or workbook) unless a range is selected
The âOptions >>â button enables the use of requirements such as âMatch caseâ
In âPayroll Data_<NETID>.xlsxâ \(\color{red}{\text{(please do not peek at the answers)đ}}\) :
Address any data quality issues identified previously.
đ
Create a new column with total compensation (salary plus overtime pay)âŚ what is the average total compensation?
Answer
Add a column for the employeeâs tenure at the company in years.
Answer
It turns out that all of the NJ employees work for separate division called âTasty New Jerseyâ. Correct the Division column with this information.
Answer
Create a binary column (0/1) that identifies employees of the âTastyâ operating units.
Answer
Add columns for Job Position and Gender from the âReference Dataâ worksheet.
Answer
Select a random 5% sample of employees to receive a survey.
I had to search the web for this one
A pivot table is an aggregation of a source table
Supports summary statistics:
This summary is presented in a table format which can be formatted and filtered
A table with categories down the rows and across the columns is a cross table
\(\color{red}{\text{Sum of Sales}}\)  \(\color{red}{\text{Month}}\)  

Salesperson  Jan  Feb  Mar  Grand Total 
Bert  7508  10360  6281  24149 
Bill  5113  8916  7642  21671 
Fred  7561  9735  11221  28517 
Harry  10513  3583  9452  23548 
Grand Total  30695  32594  34596  97885 
Excel can be used for more advanced analysis than summary statistics and charts
A histogram depicts the frequency or probability distribution of a numeric variable across bins of equal width
Variable distributions may appear statistically ânormalâ or display nonnormal characteristics such as skewness or kurtosis
Create a histogram through âData analysisâ by specifying the bins in a separate range
The CORREL() function (and Data Analysis correlation option) quantify the direction and degree of the relationship between two variables
Correlation coefficient is between 1 andÂ 1:
Negative values indicate the variables are inversely related
Positive values indicate the variables are directly related
Values close to zero indicate a weak correlation
Values close to 1 indicate a strong correlation
In addition to descriptive statistics, Excel can build and evaluate simple predictive models:
In âPayroll Data_<NETID>.xlsxâ \(\color{red}{\text{(please do not peek at the answers)đ}}\):
What is the average percentage of total compensation that is overtime pay, not including staff with no overtime pay?
Answer
Create a pivot table with the following statistics by Operating Unit:
Answer
Add Job Title to the pivot table across the columns
Answer
What is the correlation between total compensation and tenure?
Answer
Create a histogram for total compensationâŚ how would you characterize the distribution? What if you transform the data to reduce any skewness?
Answer
Create a linear regression model using tenure to predict total compensationâŚ what is the adjusted Rsquared value?
Answer
Visualization is often essential for gaining an understanding of the data and presenting findings to a new audience
Excel can easily produce a variety of basic charts
For an effective visualization, always consider:
Weâll focus more on visualization later in the course
Manchester University has low teaching scores compared to its competitors
Sometimes we want to change the data range covered by a chart or add an extra series
When you select a data series on a chart, the source data range can be adjusted
You can also use the Select Data Source dialog box from the Design tab in the ribbon or by rightclicking the chart to add a new series to a graph
In addition to creating a Pivot table on the data, you can also create a Pivot Chart which is based on the Pivot table itself
Changes made to the chart are replicated on the table and vice versa
In âPayroll Data_<NETID>.xlsxâ \(\color{red}{\text{(please do not peek at the answers)đ}}\):
I am not going to lieâŚthese took longer than they should haveâŚ.đ˘
Create a pivot table with average salary by Operating Unit and State and add conditional formatting.
Answer
Create a bar plot with the number of employees by State
Answer
Create a scatter plot of salary and tenure
Answer
Create a line chart to show the cumulative number of employees hired by year.
Answer
Getting an Excel file into a presentable state can be a challenge
Storyboard your workbook and consider how someone would âreadâ through it
Keep it simple!
The following approaches can help with the finishing touches:
To check that formulas have the right cell references, use Trace Precedents or Trace Dependents to display an arrow between linked cells
One quick tip for making a workbook look instantly less cluttered is to remove the gridlines
To ensure that the worksheet prints neatly, you may need to change settings on the Page Layout tab:
Use Print Preview to check how your worksheet will look when printed
Data validation controls what type of data a user can input into a cell, for example:
Turn in your spreadsheet with all of the exercises answered and clearly marked.
\(\color{red}{\text{Task}}\)  \(\color{red}{\text{Description}}\)  \(\color{red}{\text{Excel}}\) 

Data access  Connect to a data source  â˘File > Open â˘Open in text editor and copy/paste 
Importing data  Read the data into an analytical environment  â˘Text Import Wizard â˘Data > Text to Columns 
Data profiling  Review data dimensions and summary statistics  â˘COUNT() â˘MIN(), MAX(), etc. 
Data quality assessment  Identify aspects of the data that pose challenges for subsequent analysis  â˘Sort â˘Filter â˘COUNTBLANK() 
Data simulation  Generate data based on analytical requirements  â˘RAND() â˘RANDBETWEEN() â˘CHOOSE() 
\(\color{red}{\text{Task}}\)  \(\color{red}{\text{Description}}\)  \(\color{red}{\text{Excel}}\) 

Cleaning data  Address data quality issues to facilitate analysis  â˘Find/Replace 
Changing data types  Convert a value to the appropriate format for analysis  â˘Format 
Filtering data  Create subsets of records and features based on specified conditions  â˘Filter â˘IF() 
Deriving data  Create new features from original features  â˘MID() â˘FIND() â˘LEN() â˘ROUND() â˘WEEKDAY() â˘âŚ 
Scaling data  Put features with different ranges of values on the same scale while preserving relative values  â˘SUM() â˘AVERAGE() â˘EXP() 
Sampling data  Create subsets of records based on a probability distribution  â˘RAND() â˘RANDBETWEEN() 
Aggregating data  Return a statistic or value for one feature according to different values of another feature  â˘Pivot Table 
Reshaping data  Change whether values are represented in different records or different features  â˘Pivot Table 
Concatenating data  Combine data sets through juxtaposition  â˘Cut and paste 
Merging data  Combine data sets by matching records on a common identifier  â˘VLOOKUP() â˘HLOOKUP() â˘INDEX()/MATCH() 
\(\color{red}{\text{Task}}\)  \(\color{red}{\text{Description}}\)  \(\color{red}{\text{Excel}}\) 

Summary analysis  Calculate representative statistics for features of interest  â˘AVERAGE() â˘MEDIAN() â˘PERCENTILE.INC() 
Perform statistical tests  Estimate the probability that the data supports a specific claim  â˘Data Analysis Toolpak 
Clustering  Identify similar groups of records  â˘ 
Predictive modeling  Use one set of features to predict the value of another feature  â˘Data Analysis Toolpak 
Network analysis  â˘ 
\(\color{red}{\text{Task}}\)  \(\color{red}{\text{Description}}\)  \(\color{red}{\text{Excel}}\) 

Data visualization  Display data using lines, shapes, colors, and other abstract representations  â˘Charts 
Dashboarding  Create a collection of dynamic visualizations  â˘Charts 
Exporting data  Produce output from an analytical environment for future use  â˘File > Save As 
Make recommendations  Use results of data analysis to guide decisionmaking  â˘ 
in my opinion đâŠď¸