The extra info for module 4 is online.
Filtering joins match observations in the same way as mutating joins, but affect the observations, not the variables. There are two types:
semi_join(x, y)
keeps all observations in x
that have a match in y
.
anti_join(x, y)
drops all observations in x
that have a match in y
.
Semi-joins are useful for matching filtered summary tables back to the original rows.
Graphically, a semi-join looks like this:
Only the existence of a match is important; it doesn’t matter which observation is matched. This means that filtering joins never duplicate rows like mutating joins do:
The inverse of a semi-join is an anti-join. An anti-join keeps the rows that don’t have a match:
Anti-joins are useful for diagnosing join mismatches.
The final type of two-table verb are the set operations. Generally, I use these the least frequently, but they are occasionally useful when you want to break a single complex filter into simpler pieces. All these operations work with a complete row, comparing the values of every variable. These expect the x
and y
inputs to have the same variables, and treat the observations like sets:
intersect(x, y)
: return only observations in both x
and y
.
union(x, y)
: return unique observations in x
and y
.
setdiff(x, y)
: return observations in x
, but not in y
.
Given this simple data:
The four possibilities are:
intersect(df1, df2)
# A tibble: 1 x 2
x y
<dbl> <dbl>
1 1 1
#> # A tibble: 1 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 1
# Note that we get 3 rows, not 4
union(df1, df2)
# A tibble: 3 x 2
x y
<dbl> <dbl>
1 1 1
2 2 1
3 1 2
#> # A tibble: 3 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 1
#> 2 2 1
#> 3 1 2
setdiff(df1, df2)
# A tibble: 1 x 2
x y
<dbl> <dbl>
1 2 1
#> # A tibble: 1 x 2
#> x y
#> <dbl> <dbl>
#> 1 2 1
setdiff(df2, df1)
# A tibble: 1 x 2
x y
<dbl> <dbl>
1 1 2
#> # A tibble: 1 x 2
#> x y
#> <dbl> <dbl>
#> 1 1 2
This is an important property of what’s known as normal forms of data. The process of decomposing data frames into less redundant tables without losing information is called normalization. More information is available on Wikipedia.
Both dplyr
and SQL we mentioned in the introduction of this chapter use such normal forms. Given that they share such commonalities, once you learn either of these two tools, you can learn the other very easily.
Another useful function is rename()
, which as you may have guessed changes the name of variables.
We can also return the top n
values of a variable using the top_n()
function.