Module 2

This module is a re-introduction to Excel.

Excel

I may trash talk excel 😈, but I would argue that it is the most used tool for accountants. Accountants are the original data scientists1 and their weapon of choice has been excel. A large portion of this class will be using R, BUT I hope to show you that R and Excel can work together. At the very least you will learn how to do everything we see in this module in R and you can decide which tool you want to use. 😏 This case study is provided by PWC. It was developed for a masters level course, but I am adapting it for our class. This should be a Review and I plan to move through this quickly. But, as always, you can slow me down with questions and comments.

Module Objectives

Why excel?

  • We have no choice… everybody uses it! - this is the hard truth
  • Easy-to-use <- debatable 😅
  • Many different functions

But…

Keep it simple and ask the Internet!

Key terms

\(\color{red}{\text{Term}}\) \(\color{red}{\text{Definition}}\)
Workbook An Excel file is referred to as a workbook
Worksheet A single “tab” or “sheet” within a Workbook
Cell The boxes within the worksheet where information is stored. Cells are referenced by column letters (A, B, C, …) and row numbers (1, 2, 3, …) like a map:
  • \(\color{red}{\text{A1}}\)
  • \(\color{red}{\text{BX800}}\)
  • \(\color{red}{\text{EEE20}}\)
Range A contiguous set of cells referenced by the top left and bottom right cells, separated by a colon (:):
  • \(\color{red}{\text{A1:D23}}\)
  • \(\color{red}{\text{BA2:CT8}}\)

Cell References

The two example below provide the same result:

=SUM(C1,C2,C3)

Single cells selection-

\(\color{red}{\text{Use the Ctrl Key}}\)
=SUM(C1:C3)

Range selection-

\(\color{red}{\text{use Shift Key or Drag with mouse}}\)

Note: should a row be inserted between C1 and C3, then example 1 will still provide the same results, however example two will extend to add 4 cells.

Absolute cell references

Absolute referencing is a way of referring to cells within formulas so that once copied, the cell reference remains fixed to a particular cell

Relative referencing

As a formula is copied along, the row/column numbers adjust accordingly

Absolute referencing

As a formula is copied along, it continues to refer to the same cell as before

Absolute cell referencing

With absolute cell referencing, a dollar sign ($) appears in the cell reference:

But this will vary if the formula is copied across a range of cells \(\color{red}{\text{(relative reference)}}\)

\(\color{red}{\text{(absolute reference)}}\)

Keyboard shortcut F4 cycles through the four absolute cell reference options

Formulas

Brackets ( )
Exponent ^
Division/Multiplication / *
Addition/Subtraction + -

Functions

The fx button to the left of the formula bar opens a list of all available functions

Arguments

Functions take zero or more inputs or arguments

=FUNCTION(Input1,Input2,…)

Arguments can be:

Values 1, 2, 3 or 1.2, 3.6, 2.1 or “x”, “y”, “z” or TRUE, FALSE
Cell References B91, A452, C3

Ctrl + ➡️(⬅️,⬆️,⬇️) navigate to the last non-empty cell in that direction
Ctrl + Shift + ➡️(⬅️,⬆️,⬇️) select a block of cells to last non-empty cell in that direction
Ctrl + Space select the entire column of the cell you have selected
Shift + Space select the entire row of the cell you have selected

Acquire data

Getting data into excel

Example - Delimited text

Text to columns

References to worksheets and workbooks

Sample spreadsheet

Sorting

Filters

Custom filters

Custom filters allow you to use multiple criteria such as AND or OR

Summary statistics

COUNT() - Counts numbers

COUNTBLANK() - Counts empty cells

COUNTA() - Counts non-empty cells

MIN() - Returns minimum

MAX() - Returns maximum

AVERAGE() - Returns average/mean

ROWS/COLUMNS() - Return number of rows/columns

Status Bar

Can we rely on the data?

Data quality dimensions

Exercise - Data quality dimensions

Identify which of the six DQ dimensions applies to the issues described below \(\color{red}{\text{(please do not peek at the answers 😈)}}\):

  1. ‘Gender’ field has the special characters like ~!@#$%^*();

    Answer

    [1] "Validity"

  2. ‘First name’ is blank or Null

    Answer

    [1] "Completeness"

  3. ‘Last name’ field has only designators such as LLP, LLC, Mr., Mrs., etc

    Answer

    [1] "Accuracy"

  4. ‘Address’ field has only numbers

    Answer

    [1] "Validity"

  5. ‘Account Type’ field does not have pre-defined list of values

    Answer

    [1] "Consistency"

  6. ‘Account Number’ field have duplicate values

    Answer

    [1] "Integrity"

  7. ‘Forex rate’ field does not have up to date exchange rate

    Answer

    [1] "Timeliness"

Exercise - Excel #1

Load the data from ‘payroll_data.txt’ and ‘reference_data.txt’ into two tabs in an Excel spreadsheet and save it as ‘Payroll Data_<NETID>.xlsx’ \(\color{red}{\text{(please do not peek at the answers)😈}}\) :

  1. How did you import the data?

    Answer

  2. How many rows are there?

    Answer

    [1] "payroll_data = 202 rows including header"  
    [2] "reference_data = 199 rows including header"

  3. How many columns?

    Answer

    [1] "payroll_data = 9 columns"    "reference_data = 3 columns*"

  4. What is the average salary?

    Answer

  5. How many missing values are in each column?

    Answer

  6. What potential data quality issues do you find? Which dimensions are they related to?

    Answer (one example)

Transform Data

Data formats

Data entered an Excel worksheet can be represented in different formats, for example:

Dates are stored as the number of days from a fixed historical date

Text functions

=CONCATENATE (text1, text2, …) or &

Appends two or more strings

=MID(text, start_num, num_chars)

Extracts a specific number of characters starting from a given position

=LEFT(text, num_chars)

Extracts a specific number of characters from the left of a cell

=RIGHT(text, num_chars)

Extracts a specific number of characters from the right of a cell

=LEN(text)

Return the number of characters (including spaces) in a cell

=SUBSTITUTE(text, old_text, new_text)

Replaces a string with another string

=FIND(find_text, within_text, start_position)

Return the position of a match and error if no match

Number functions

=ROUND(number, num_digits)

Rounds a figure to a specified number of digits

=LARGE(array, k)

Returns the kth largest number

=SMALL(array, k)

Returns the kth smallest number

=PRODUCT(number1, number2, …)

Multiplies several values together

=EXP(number)

Returns Euler’s number (e) raised to a number

=RAND()

Returns a random number between 0 and 1

=RANDBETWEEN(bottom, top)

Returns a random integer between the specified values

Date/time functions

=TODAY()

Displays current date

=NOW()

Displays current date and time

=DATE(year, month, day)

Generates a date given day, month, and year

=DAY(serial_number)

Determines day of date, e.g. 31

=MONTH(serial_number)

Determines month of date, e.g. 12

=YEAR(serial_number)

Determines year of date, e.g. 2001

=WEEKNUM(serial_number)

Returns the week number in the year

=WEEKDAY(serial_number)

Returns the position of the day in a workweek

=DAYS(end_date, start_date)

Returns number of days between two dates

=NETWORKDAYS(start_date, end_date)

Returns the number of working days between dates

Fill

Error trapping

=ISNA(value)

Determine if cell or result of formula is showing #N/A!

=ISERROR(value)

Determine if cell or result of formula is an error

=IFERROR(value, value_if_error)

Give an alternative result if the formula produces an error

IF statements

=IF ( \(\color{darkred}{\text{logical_test}}\) , \(\color{red}{\text{value_if_true}}\) , \(\color{maroon}{\text{value_if_false}}\) )

Examples

=IF ( \(\color{darkred}{\text{B1 > 50}}\) , \(\color{red}{\text{B1 * A1}}\), \(\color{maroon}{\text{B1 * A3}}\))

=IF ( \(\color{darkred}{\text{B1 > B2}}\) , \(\color{red}{\text{B1}}\) , \(\color{maroon}{\text{B2}}\) )

=IF ( \(\color{darkred}{\text{B1 = "awesome"}}\) , \(\color{red}{\text{B1}}\) , \(\color{maroon}{\text{B2}}\) )

Conditional operators

Any function can be used with conditional operators in IF statements to test conditions

> Greater than

< Less than

>= Greater than and equal to

<= Less than and equal to

<> Not equal to

= Equal to

NOTE:

Some of functions can be used only on numbers, some to dates and some to strings.

Nested conditional statements

=IF (\(\color{darkred}{\text{logical_test}}\) , \(\color{red}{\text{value_if_true}}\) , IF (\(\color{darkred}{\text{logical_test}}\) , \(\color{red}{\text{value_if_true}}\) , \(\color{maroon}{\text{value_if_false}}\) ))

=IF ( \(\color{darkred}{\text{B1 > 50}}\) , \(\color{red}{\text{B1 * A1}}\), IF ( \(\color{darkred}{\text{B1 > 150}}\) , \(\color{red}{\text{B1 * A2}}\), \(\color{maroon}{\text{B1 * A3}}\)) )

Logical operators

=AND(\(\color{darkred}{\text{logical1}}\), \(\color{red}{\text{logical2}}\), …) Gives True if all the conditions are met

=OR(\(\color{darkred}{\text{logical1}}\), \(\color{red}{\text{logical2}}\), …) Gives True if at least one condition is met

=NOT(\(\color{darkred}{\text{logical}}\)) Gives True if the condition is not met

VLOOKUP

Always use absolute cell references to specify the range containing the lookup data

Unique values and duplicates

Find/Replace

Exercise Excel #2

In ‘Payroll Data_<NETID>.xlsx’ \(\color{red}{\text{(please do not peek at the answers)😈}}\) :

  1. Address any data quality issues identified previously.

    👍

  2. Create a new column with total compensation (salary plus overtime pay)… what is the average total compensation?

    Answer

  3. Add a column for the employee’s tenure at the company in years.

    Answer

  4. It turns out that all of the NJ employees work for separate division called “Tasty New Jersey”. Correct the Division column with this information.

    Answer

  5. Create a binary column (0/1) that identifies employees of the “Tasty” operating units.

    Answer

  6. Add columns for Job Position and Gender from the ‘Reference Data’ worksheet.

    Answer

  7. Select a random 5% sample of employees to receive a survey.

    I had to search the web for this one

Analyze data

Pivot tables

  • A pivot table is an aggregation of a source table

  • Supports summary statistics:

    • Count
    • Sum
    • Min/Max
    • Average

\(\color{red}{\text{Sum of Sales}}\) \(\color{red}{\text{Month}}\)
Salesperson Jan Feb Mar Grand Total
Bert 7508 10360 6281 24149
Bill 5113 8916 7642 21671
Fred 7561 9735 11221 28517
Harry 10513 3583 9452 23548
Grand Total 30695 32594 34596 97885

Customizing pivot tables

Analysis ToolPak

Excel can be used for more advanced analysis than summary statistics and charts

  1. Click the File tab, click Options, and then click the Add-Ins category
  2. In the Manage box, select Excel Add-ins and then click ‘Go’
  3. In the Add-Ins box, check the ‘Analysis ToolPak’ and ‘Solver’ check boxes, and then click ‘OK’
  4. Go to the ‘Data’ tab and look for the Analysis section

Descriptive statistics

Histograms

A histogram depicts the frequency or probability distribution of a numeric variable across bins of equal width

Variable distributions may appear statistically “normal” or display non-normal characteristics such as skewness or kurtosis

Create a histogram through ‘Data analysis’ by specifying the bins in a separate range

Correlation

  • The CORREL() function (and Data Analysis correlation option) quantify the direction and degree of the relationship between two variables

  • Correlation coefficient is between -1 and 1:

    • Negative values indicate the variables are inversely related

    • Positive values indicate the variables are directly related

    • Values close to zero indicate a weak correlation

    • Values close to 1 indicate a strong correlation

Linear regression (univariate)

In addition to descriptive statistics, Excel can build and evaluate simple predictive models:

  1. Select ‘Regression’ from the Data Analysis section
  2. Select your dependent (Y) variable (to be predicted)
  3. Select your independent (X) variable (used to predict the Y variable)
  4. Check boxes for optional outputs (e.g. residuals, line fit plot)

Linear regression output

Exercise #3

In ‘Payroll Data_<NETID>.xlsx’ \(\color{red}{\text{(please do not peek at the answers)😈}}\):

  1. What is the average percentage of total compensation that is overtime pay, not including staff with no overtime pay?

    Answer

  2. Create a pivot table with the following statistics by Operating Unit:

    • Number of employees
    • Average total compensation
    • Average tenure (in years)

    Answer

  3. Add Job Title to the pivot table across the columns

    Answer

  4. What is the correlation between total compensation and tenure?

    Answer

  5. Create a histogram for total compensation… how would you characterize the distribution? What if you transform the data to reduce any skewness?

    Answer

  6. Create a linear regression model using tenure to predict total compensation… what is the adjusted R-squared value?

    Answer

Present findings

Visualization in excel

Conditional formatting

  • Conditional formatting adjusts the color of a cell according to the relative magnitude of the values or established rules
  • It is commonly used for RAG (Red/Amber/Green) reports, in which good figures are highlighted in green and bad ones in red, but any gradient can be applied

What do you think of this visualization?

What about this one?

Manchester University has low teaching scores compared to its competitors

Effective visualization

Basic charts

Insert chart

  • The first step to setting up your chart is to arrange the data into a table
  • Then select the table and select the type of graph you are looking for on the Insert tab

Axis settings

Data labels

Gridlines

Titles

Data in rows or columns

Changing data range

Sometimes we want to change the data range covered by a chart or add an extra series

When you select a data series on a chart, the source data range can be adjusted

Select data source

You can also use the Select Data Source dialog box from the Design tab in the ribbon or by right-clicking the chart to add a new series to a graph

Pivot charts

In addition to creating a Pivot table on the data, you can also create a Pivot Chart which is based on the Pivot table itself

Changes made to the chart are replicated on the table and vice versa

Exercise #4

In ‘Payroll Data_<NETID>.xlsx’ \(\color{red}{\text{(please do not peek at the answers)😈}}\):

I am not going to lie…these took longer than they should have….😢

  1. Create a pivot table with average salary by Operating Unit and State and add conditional formatting.

    Answer

  2. Create a bar plot with the number of employees by State

    Answer

  3. Create a scatter plot of salary and tenure

    Answer

  4. Create a line chart to show the cumulative number of employees hired by year.

    Answer

Preparing the final workbook

Formula auditing

To check that formulas have the right cell references, use Trace Precedents or Trace Dependents to display an arrow between linked cells

Removing gridlines

One quick tip for making a workbook look instantly less cluttered is to remove the gridlines

Freeze panes

Page layout

To ensure that the worksheet prints neatly, you may need to change settings on the Page Layout tab:

Use Print Preview to check how your worksheet will look when printed

Data validation

Data validation controls what type of data a user can input into a cell, for example:

  • Only a number within a certain range
  • Only a time/date within a certain period
  • Only an item from a predefined list

More leading practices

Assignment

Turn in your spreadsheet with all of the exercises answered and clearly marked.

Summary

Acquire data

\(\color{red}{\text{Task}}\) \(\color{red}{\text{Description}}\) \(\color{red}{\text{Excel}}\)
Data access Connect to a data source

•File > Open

•Open in text editor and copy/paste

Importing data Read the data into an analytical environment

•Text Import Wizard

•Data > Text to Columns

Data profiling Review data dimensions and summary statistics

•COUNT()

•MIN(), MAX(), etc.

Data quality assessment Identify aspects of the data that pose challenges for subsequent analysis

•Sort

•Filter

•COUNTBLANK()

Data simulation Generate data based on analytical requirements

•RAND()

•RANDBETWEEN()

•CHOOSE()

Transform data

\(\color{red}{\text{Task}}\) \(\color{red}{\text{Description}}\) \(\color{red}{\text{Excel}}\)
Cleaning data Address data quality issues to facilitate analysis •Find/Replace
Changing data types Convert a value to the appropriate format for analysis •Format
Filtering data Create subsets of records and features based on specified conditions

•Filter

•IF()

Deriving data Create new features from original features

•MID()

•FIND()

•LEN()

•ROUND()

•WEEKDAY()

•…

Scaling data Put features with different ranges of values on the same scale while preserving relative values

•SUM()

•AVERAGE()

•EXP()

Sampling data Create subsets of records based on a probability distribution

•RAND()

•RANDBETWEEN()

Aggregating data Return a statistic or value for one feature according to different values of another feature •Pivot Table
Reshaping data Change whether values are represented in different records or different features •Pivot Table
Concatenating data Combine data sets through juxtaposition •Cut and paste
Merging data Combine data sets by matching records on a common identifier

•VLOOKUP()

•HLOOKUP()

•INDEX()/MATCH()

Analyze data

\(\color{red}{\text{Task}}\) \(\color{red}{\text{Description}}\) \(\color{red}{\text{Excel}}\)
Summary analysis Calculate representative statistics for features of interest

•AVERAGE()

•MEDIAN()

•PERCENTILE.INC()

Perform statistical tests Estimate the probability that the data supports a specific claim •Data Analysis Toolpak
Clustering Identify similar groups of records •
Predictive modeling Use one set of features to predict the value of another feature •Data Analysis Toolpak
Network analysis •

Present findings

\(\color{red}{\text{Task}}\) \(\color{red}{\text{Description}}\) \(\color{red}{\text{Excel}}\)
Data visualization Display data using lines, shapes, colors, and other abstract representations •Charts
Dashboarding Create a collection of dynamic visualizations •Charts
Exporting data Produce output from an analytical environment for future use •File > Save As
Make recommendations Use results of data analysis to guide decision-making •

  1. in my opinion 😄↩︎