This module is a re-introduction to Excel.
I may trash talk excel đ, but I would argue that it is the most used tool for accountants. Accountants are the original data scientists1 and their weapon of choice has been excel. A large portion of this class will be using R, BUT I hope to show you that R and Excel can work together. At the very least you will learn how to do everything we see in this module in R and you can decide which tool you want to use. đ This case study is provided by PWC. It was developed for a masters level course, but I am adapting it for our class. This should be a Review and I plan to move through this quickly. But, as always, you can slow me down with questions and comments.
Demonstrate knowledge of data quality concepts
Demonstrate knowledge of key terms and capabilities in Excel
Demonstrate how to use Excel to acquire, transform, analyze, and visualize data
Demonstrate knowledge of leading practices for presenting findings in Excel
ButâŚ
Keep it simple and ask the Internet!
\(\color{red}{\text{Term}}\) | \(\color{red}{\text{Definition}}\) |
---|---|
Workbook | An Excel file is referred to as a workbook |
Worksheet | A single âtabâ or âsheetâ within a Workbook |
Cell | The boxes within the worksheet where information is stored. Cells are referenced by column letters (A, B, C, âŚ) and row numbers (1, 2, 3, âŚ) like a map: |
|
|
Range | A contiguous set of cells referenced by the top left and bottom right cells, separated by a colon (:): |
|
Excel formulas can reference different cells
Changes to the referenced cells result in updates to value calculated by the formula
The two example below provide the same result:
=SUM(C1,C2,C3) | Single cells selection- \(\color{red}{\text{Use the Ctrl Key}}\) |
|
---|---|---|
=SUM(C1:C3) | Range selection- \(\color{red}{\text{use Shift Key or Drag with mouse}}\) |
Note: should a row be inserted between C1 and C3, then example 1 will still provide the same results, however example two will extend to add 4 cells.
Absolute referencing is a way of referring to cells within formulas so that once copied, the cell reference remains fixed to a particular cell
Relative referencing As a formula is copied along, the row/column numbers adjust accordingly |
|
Absolute referencing As a formula is copied along, it continues to refer to the same cell as before |
With absolute cell referencing, a dollar sign ($) appears in the cell reference:
But this will vary if the formula is copied across a range of cells \(\color{red}{\text{(relative reference)}}\)
\(\color{red}{\text{(absolute reference)}}\)
Keyboard shortcut F4 cycles through the four absolute cell reference options
Excel calculations are specified with formulas in each cell
To create a formula, type an equals sign (=) followed by the function and required arguments
Order of Operations
Brackets | ( ) |
Exponent | ^ |
Division/Multiplication | / * |
Addition/Subtraction | + - |
The fx button to the left of the formula bar opens a list of all available functions
Functions take zero or more inputs or arguments
=FUNCTION(Input1,Input2,âŚ)
Arguments can be:
Values | 1, 2, 3 or 1.2, 3.6, 2.1 or âxâ, âyâ, âzâ or TRUE, FALSE |
Cell References | B91, A452, C3 |
Ctrl + âĄď¸(⏠ď¸,âŹď¸,âŹď¸) | navigate to the last non-empty cell in that direction |
Ctrl + Shift + âĄď¸(⏠ď¸,âŹď¸,âŹď¸) | select a block of cells to last non-empty cell in that direction |
Ctrl + Space | select the entire column of the cell you have selected |
Shift + Space | select the entire row of the cell you have selected |
Excel files have the extension .xlsx or .xls (older versions)
Excel can also import data from delimited text files
Some comma-separated values files can be opened directly by Excel
Tab-separated values can be copy-and-pasted directly into an Excel worksheet
Delimited text can also be copy-and-pasted into Excel and then separated with text-to-columns
Follow the steps prompted when clicking the icon in the Data tab. You can specify:
Highlight a range and use the Sort option to order records by one or more columns
Make sure to select an entire table to avoid sorting only part of it
Filtering allows you to hide rows in a range if they donât match select criteria
Highlight the range and click the âFilterâ button
Custom filters allow you to use multiple criteria such as AND or OR
COUNT() - Counts numbers
COUNTBLANK() - Counts empty cells
COUNTA() - Counts non-empty cells
MIN() - Returns minimum
MAX() - Returns maximum
AVERAGE() - Returns average/mean
ROWS/COLUMNS() - Return number of rows/columns
Highlighting a range of cells returns summary statistics on the status bar in the lower right corner
This is a quick way to get information about part of your data
Identify which of the six DQ dimensions applies to the issues described below \(\color{red}{\text{(please do not peek at the answers đ)}}\):
âGenderâ field has the special characters like ~!@#$%^*();
Answer
[1] "Validity"
âFirst nameâ is blank or Null
Answer
[1] "Completeness"
âLast nameâ field has only designators such as LLP, LLC, Mr., Mrs., etc
Answer
[1] "Accuracy"
âAddressâ field has only numbers
Answer
[1] "Validity"
âAccount Typeâ field does not have pre-defined list of values
Answer
[1] "Consistency"
âAccount Numberâ field have duplicate values
Answer
[1] "Integrity"
âForex rateâ field does not have up to date exchange rate
Answer
[1] "Timeliness"
Load the data from âpayroll_data.txtâ and âreference_data.txtâ into two tabs in an Excel spreadsheet and save it as âPayroll Data_<NETID>.xlsxâ \(\color{red}{\text{(please do not peek at the answers)đ}}\) :
How did you import the data?
Answer
How many rows are there?
Answer
[1] "payroll_data = 202 rows including header"
[2] "reference_data = 199 rows including header"
How many columns?
Answer
[1] "payroll_data = 9 columns" "reference_data = 3 columns*"
What is the average salary?
Answer
How many missing values are in each column?
Answer
What potential data quality issues do you find? Which dimensions are they related to?
Answer (one example)
Data entered an Excel worksheet can be represented in different formats, for example:
Dates are stored as the number of days from a fixed historical date
Appends two or more strings
Extracts a specific number of characters starting from a given position
Extracts a specific number of characters from the left of a cell
Extracts a specific number of characters from the right of a cell
Return the number of characters (including spaces) in a cell
Replaces a string with another string
Return the position of a match and error if no match
Rounds a figure to a specified number of digits
Returns the kth largest number
Returns the kth smallest number
Multiplies several values together
Returns Eulerâs number (e) raised to a number
Returns a random number between 0 and 1
Returns a random integer between the specified values
Displays current date
Displays current date and time
Generates a date given day, month, and year
Determines day of date, e.g. 31
Determines month of date, e.g. 12
Determines year of date, e.g. 2001
Returns the week number in the year
Returns the position of the day in a workweek
Returns number of days between two dates
Returns the number of working days between dates
You can apply a formula to an entire column by using the Fill options
You can also Fill Down by double-clicking a cell with a formula in a table
Determine if cell or result of formula is showing #N/A!
Determine if cell or result of formula is an error
Give an alternative result if the formula produces an error
IF() analyses the contents of one or more cells to determine whether a condition is TRUE or FALSE
Given the value of the condition, different values can be returned
Syntax
=IF ( \(\color{darkred}{\text{logical_test}}\) , \(\color{red}{\text{value_if_true}}\) , \(\color{maroon}{\text{value_if_false}}\) )
Examples
=IF ( \(\color{darkred}{\text{B1 > 50}}\) , \(\color{red}{\text{B1 * A1}}\), \(\color{maroon}{\text{B1 * A3}}\))
=IF ( \(\color{darkred}{\text{B1 > B2}}\) , \(\color{red}{\text{B1}}\) , \(\color{maroon}{\text{B2}}\) )
=IF ( \(\color{darkred}{\text{B1 = "awesome"}}\) , \(\color{red}{\text{B1}}\) , \(\color{maroon}{\text{B2}}\) )
Any function can be used with conditional operators in IF statements to test conditions
> Greater than
< Less than
>= Greater than and equal to
<= Less than and equal to
<> Not equal to
= Equal to
Some of functions can be used only on numbers, some to dates and some to strings.
=IF (\(\color{darkred}{\text{logical_test}}\) , \(\color{red}{\text{value_if_true}}\) , IF (\(\color{darkred}{\text{logical_test}}\) , \(\color{red}{\text{value_if_true}}\) , \(\color{maroon}{\text{value_if_false}}\) ))
=IF ( \(\color{darkred}{\text{B1 > 50}}\) , \(\color{red}{\text{B1 * A1}}\), IF ( \(\color{darkred}{\text{B1 > 150}}\) , \(\color{red}{\text{B1 * A2}}\), \(\color{maroon}{\text{B1 * A3}}\)) )
=AND(\(\color{darkred}{\text{logical1}}\), \(\color{red}{\text{logical2}}\), âŚ) Gives True if all the conditions are met
=OR(\(\color{darkred}{\text{logical1}}\), \(\color{red}{\text{logical2}}\), âŚ) Gives True if at least one condition is met
=NOT(\(\color{darkred}{\text{logical}}\)) Gives True if the condition is not met
Always use absolute cell references to specify the range containing the lookup data
You may need to find or replace values across a worksheet or workbook, which would be time-consuming if done manually
Use Find/Replace to do this efficiently:
Ctrl + F: find
Ctrl + H: replace
Excel will search the whole worksheet (or workbook) unless a range is selected
The âOptions >>â button enables the use of requirements such as âMatch caseâ
In âPayroll Data_<NETID>.xlsxâ \(\color{red}{\text{(please do not peek at the answers)đ}}\) :
Address any data quality issues identified previously.
đ
Create a new column with total compensation (salary plus overtime pay)⌠what is the average total compensation?
Answer
Add a column for the employeeâs tenure at the company in years.
Answer
It turns out that all of the NJ employees work for separate division called âTasty New Jerseyâ. Correct the Division column with this information.
Answer
Create a binary column (0/1) that identifies employees of the âTastyâ operating units.
Answer
Add columns for Job Position and Gender from the âReference Dataâ worksheet.
Answer
Select a random 5% sample of employees to receive a survey.
I had to search the web for this one
A pivot table is an aggregation of a source table
Supports summary statistics:
This summary is presented in a table format which can be formatted and filtered
A table with categories down the rows and across the columns is a cross table
\(\color{red}{\text{Sum of Sales}}\) | \(\color{red}{\text{Month}}\) | |||
---|---|---|---|---|
Salesperson | Jan | Feb | Mar | Grand Total |
Bert | 7508 | 10360 | 6281 | 24149 |
Bill | 5113 | 8916 | 7642 | 21671 |
Fred | 7561 | 9735 | 11221 | 28517 |
Harry | 10513 | 3583 | 9452 | 23548 |
Grand Total | 30695 | 32594 | 34596 | 97885 |
Excel can be used for more advanced analysis than summary statistics and charts
A histogram depicts the frequency or probability distribution of a numeric variable across bins of equal width
Variable distributions may appear statistically ânormalâ or display non-normal characteristics such as skewness or kurtosis
Create a histogram through âData analysisâ by specifying the bins in a separate range
The CORREL() function (and Data Analysis correlation option) quantify the direction and degree of the relationship between two variables
Correlation coefficient is between -1 and 1:
Negative values indicate the variables are inversely related
Positive values indicate the variables are directly related
Values close to zero indicate a weak correlation
Values close to 1 indicate a strong correlation
In addition to descriptive statistics, Excel can build and evaluate simple predictive models:
In âPayroll Data_<NETID>.xlsxâ \(\color{red}{\text{(please do not peek at the answers)đ}}\):
What is the average percentage of total compensation that is overtime pay, not including staff with no overtime pay?
Answer
Create a pivot table with the following statistics by Operating Unit:
Answer
Add Job Title to the pivot table across the columns
Answer
What is the correlation between total compensation and tenure?
Answer
Create a histogram for total compensation⌠how would you characterize the distribution? What if you transform the data to reduce any skewness?
Answer
Create a linear regression model using tenure to predict total compensation⌠what is the adjusted R-squared value?
Answer
Visualization is often essential for gaining an understanding of the data and presenting findings to a new audience
Excel can easily produce a variety of basic charts
For an effective visualization, always consider:
Weâll focus more on visualization later in the course
Manchester University has low teaching scores compared to its competitors
Sometimes we want to change the data range covered by a chart or add an extra series
When you select a data series on a chart, the source data range can be adjusted
You can also use the Select Data Source dialog box from the Design tab in the ribbon or by right-clicking the chart to add a new series to a graph
In addition to creating a Pivot table on the data, you can also create a Pivot Chart which is based on the Pivot table itself
Changes made to the chart are replicated on the table and vice versa
In âPayroll Data_<NETID>.xlsxâ \(\color{red}{\text{(please do not peek at the answers)đ}}\):
I am not going to lieâŚthese took longer than they should haveâŚ.đ˘
Create a pivot table with average salary by Operating Unit and State and add conditional formatting.
Answer
Create a bar plot with the number of employees by State
Answer
Create a scatter plot of salary and tenure
Answer
Create a line chart to show the cumulative number of employees hired by year.
Answer
Getting an Excel file into a presentable state can be a challenge
Storyboard your workbook and consider how someone would âreadâ through it
Keep it simple!
The following approaches can help with the finishing touches:
To check that formulas have the right cell references, use Trace Precedents or Trace Dependents to display an arrow between linked cells
One quick tip for making a workbook look instantly less cluttered is to remove the gridlines
To ensure that the worksheet prints neatly, you may need to change settings on the Page Layout tab:
Use Print Preview to check how your worksheet will look when printed
Data validation controls what type of data a user can input into a cell, for example:
Turn in your spreadsheet with all of the exercises answered and clearly marked.
\(\color{red}{\text{Task}}\) | \(\color{red}{\text{Description}}\) | \(\color{red}{\text{Excel}}\) |
---|---|---|
Data access | Connect to a data source | â˘File > Open â˘Open in text editor and copy/paste |
Importing data | Read the data into an analytical environment | â˘Text Import Wizard â˘Data > Text to Columns |
Data profiling | Review data dimensions and summary statistics | â˘COUNT() â˘MIN(), MAX(), etc. |
Data quality assessment | Identify aspects of the data that pose challenges for subsequent analysis | â˘Sort â˘Filter â˘COUNTBLANK() |
Data simulation | Generate data based on analytical requirements | â˘RAND() â˘RANDBETWEEN() â˘CHOOSE() |
\(\color{red}{\text{Task}}\) | \(\color{red}{\text{Description}}\) | \(\color{red}{\text{Excel}}\) |
---|---|---|
Cleaning data | Address data quality issues to facilitate analysis | â˘Find/Replace |
Changing data types | Convert a value to the appropriate format for analysis | â˘Format |
Filtering data | Create subsets of records and features based on specified conditions | â˘Filter â˘IF() |
Deriving data | Create new features from original features | â˘MID() â˘FIND() â˘LEN() â˘ROUND() â˘WEEKDAY() â˘âŚ |
Scaling data | Put features with different ranges of values on the same scale while preserving relative values | â˘SUM() â˘AVERAGE() â˘EXP() |
Sampling data | Create subsets of records based on a probability distribution | â˘RAND() â˘RANDBETWEEN() |
Aggregating data | Return a statistic or value for one feature according to different values of another feature | â˘Pivot Table |
Reshaping data | Change whether values are represented in different records or different features | â˘Pivot Table |
Concatenating data | Combine data sets through juxtaposition | â˘Cut and paste |
Merging data | Combine data sets by matching records on a common identifier | â˘VLOOKUP() â˘HLOOKUP() â˘INDEX()/MATCH() |
\(\color{red}{\text{Task}}\) | \(\color{red}{\text{Description}}\) | \(\color{red}{\text{Excel}}\) |
---|---|---|
Summary analysis | Calculate representative statistics for features of interest | â˘AVERAGE() â˘MEDIAN() â˘PERCENTILE.INC() |
Perform statistical tests | Estimate the probability that the data supports a specific claim | â˘Data Analysis Toolpak |
Clustering | Identify similar groups of records | ⢠|
Predictive modeling | Use one set of features to predict the value of another feature | â˘Data Analysis Toolpak |
Network analysis | ⢠|
\(\color{red}{\text{Task}}\) | \(\color{red}{\text{Description}}\) | \(\color{red}{\text{Excel}}\) |
---|---|---|
Data visualization | Display data using lines, shapes, colors, and other abstract representations | â˘Charts |
Dashboarding | Create a collection of dynamic visualizations | â˘Charts |
Exporting data | Produce output from an analytical environment for future use | â˘File > Save As |
Make recommendations | Use results of data analysis to guide decision-making | ⢠|
in my opinion đâŠď¸