Last time we were working on joins. A helpful tool for understanding joining syntax/semantics (whether in R or otherwise) is by some quick visualization…
Let’s paint some pictures using really simple
example in a hand-crafted tibble. In the below we’ll use a single key
called key and a single value column (val_x
and val_y), but the ideas all generalize to arbitrary
numbers of keys and values.
x <- tribble(
~key, ~val_x,
1, "x1",
2, "x2",
3, "x3"
)
y <- tribble(
~key, ~val_y,
1, "y1",
2, "y2",
4, "y3"
)
To understand how joins work, it’s useful to think of every possible match. Here that’s shown with a grid of connecting lines
To
describe a specific type of join, we indicate matches with dots. The
matches determine the rows in the output, a new data frame that contains
the key, the
x values, and the y values.
For instance, below we show an innter join, where rows are retained only if and only if the keys are equal
We can apply the same principles to explain the outer
joins, which keep observations that appear in at least one of
the data frames. These joins work by adding an additional “virtual”
observation to each data frame. This observation has a key that matches
if no other key matches, and values filled with NA. There
are three types of outer joins:
x: Every row of x is preserved in the output
because it can fall back to matching a row of NAs in
y. y: Every row of y is preserved in the output
because it can fall back to matching a row of NAs in
x. The output still matches x as much as
possible; any extra rows from y are added to the end. x or y: Every row of x and
yis included in the output because bothxand y
have a fall back row of NAs. Again, the output starts with
all rows from x, followed by the remaining unmatched
y rows. Let’s say we have the following snippet of data
participants <- tibble(
id = c("P1", "P2", "P3", "P4"),
age = c(22, 24, 20, 23)
)
sessions <- tibble(
id = c("P2", "P3", "P5"),
condition = c("A", "B", "A")
)
participants
## # A tibble: 4 × 2
## id age
## <chr> <dbl>
## 1 P1 22
## 2 P2 24
## 3 P3 20
## 4 P4 23
sessions
## # A tibble: 3 × 2
## id condition
## <chr> <chr>
## 1 P2 A
## 2 P3 B
## 3 P5 A
Let’s try to do that together!
Q: See if you can figure out what will happen if you run the
following code:
inner_join(participants, sessions, by = "id")
(Try to work it out before actually running it)
Q: How might that differ if you ran
this instead?
left_join(participants, sessions, by = "id")
Getting your data into the right format for the analysis you’d like to run can be a task all on its own.
Thankfully there are a few very common transformations that cover the vast majority of cases. So the goal of this part is to get you familiar with the main operation you’ll want to master: converting between wide and long representations of data.
You can represent the same underlying data in multiple ways.
The example below shows the same data organized in three different
ways. Each dataset shows the same values of four variables: -
country, - year, - population, -
tuberculosis (i.e. the number of documented cases of
TB)
But each dataset organizes the values in a different way.
table1
## # A tibble: 6 × 4
## country year cases population
## <chr> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071
## 2 Afghanistan 2000 2666 20595360
## 3 Brazil 1999 37737 172006362
## 4 Brazil 2000 80488 174504898
## 5 China 1999 212258 1272915272
## 6 China 2000 213766 1280428583
table2
## # A tibble: 12 × 4
## country year type count
## <chr> <dbl> <chr> <dbl>
## 1 Afghanistan 1999 cases 745
## 2 Afghanistan 1999 population 19987071
## 3 Afghanistan 2000 cases 2666
## 4 Afghanistan 2000 population 20595360
## 5 Brazil 1999 cases 37737
## 6 Brazil 1999 population 172006362
## 7 Brazil 2000 cases 80488
## 8 Brazil 2000 population 174504898
## 9 China 1999 cases 212258
## 10 China 1999 population 1272915272
## 11 China 2000 cases 213766
## 12 China 2000 population 1280428583
table3
## # A tibble: 6 × 3
## country year rate
## <chr> <dbl> <chr>
## 1 Afghanistan 1999 745/19987071
## 2 Afghanistan 2000 2666/20595360
## 3 Brazil 1999 37737/172006362
## 4 Brazil 2000 80488/174504898
## 5 China 1999 212258/1272915272
## 6 China 2000 213766/1280428583
Take a look at each of table{1,2,3}. Try and describe what the three formats of the different data are?
What is the mapping between them conceptually.
Just one of them, table1, will be much easier to work
with for our purposes because it’s what we call
tidy.
(We actually say this a couple weeks ago but it’s worth reiterating)
There are three interrelated rules that make a dataset tidy:
Why ensure that your data is tidy? There are two main advantages:
dplyr, ggplot2, and other useful packages
are designed to work with tidy data. Here are a few
small examples showing how you might work with table1.
From table1 it’s relatively straightforward to derive
some things:
# Compute rate per 10,000
table1 %>%
mutate(rate = cases / population * 10000)
## # A tibble: 6 × 5
## country year cases population rate
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 1999 745 19987071 0.373
## 2 Afghanistan 2000 2666 20595360 1.29
## 3 Brazil 1999 37737 172006362 2.19
## 4 Brazil 2000 80488 174504898 4.61
## 5 China 1999 212258 1272915272 1.67
## 6 China 2000 213766 1280428583 1.67
# Or compute total cases per year
table1 %>%
group_by(year) %>%
summarize(total_cases = sum(cases))
## # A tibble: 2 × 2
## year total_cases
## <dbl> <dbl>
## 1 1999 250740
## 2 2000 296920
# We can easily visualize this change over time
ggplot(table1, aes(x = year, y = cases)) +
geom_line(aes(group = country), color = "grey50", linewidth= 2 ) +
geom_point(aes(color = country, shape = country, size = 6)) +
scale_x_continuous(breaks = c(1999, 2000)) + # x-axis breaks at 1999 and 2000
theme_bw()
Q: See if you can perform any of
the same operations from table2
How would you calculate rate per 10,000 here?
Lots of data you’ll encounter is not like
table1
Data is often organized to facilitate some goal other than analysis. For example, it’s common for data to be structured to make data entry or saving easy, rather than analysis.
This is likely what happens if you’ve run an online experiment (usign PCIbex) or similar.
This means that most real analyses will require at least a little tidying. You’ll begin by figuring out what the underlying variables and observations are. Sometimes this is easy; other times you’ll need to consult some documentation, or the the people who originally generated the data.
Next, you’ll pivot your data into a tidy form, with variables in the columns and observations in the rows.
tidyr provides two functions for pivoting data: 1.
pivot_longer() 2. pivot_wider()
We’ll describe each in turn starting with
pivot_longer()
The billboard dataset records the billboard rank of
songs in the year 2000:
head(billboard)
## # A tibble: 6 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors Do… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors Do… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## # wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …
In this dataset, each observation is a song. The first three columns
(artist, track and date.entered)
are variables that describe the song. Then we have 76 columns
(wk1-wk76) that describe the rank of the song
in each week1.
Here, the column names are one variable (the week) and
the cell values are another (the rank).
To tidy this data, we’ll use pivot_longer():
Start by reading the documentation:
?pivot_longer
Then here’s how it’ll apply to the billboard data
long.billboard <- billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank"
)
head(long.billboard, 10)
## # A tibble: 10 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk8 NA
## 9 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk9 NA
## 10 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk10 NA
As described in the documentation, there are three key arguments (aside from the data itself):
cols specifies which columns need to be pivoted,
i.e. which columns aren’t variables. This argument uses the same syntax
as select() so here we could use
!c(artist, track, date.entered) or
starts_with("wk"). The explanation point (!)
before the vector in select tells are what columsn to
exclude rather than includenames_to names the variable stored in the column names,
we named that variable weekvalues_to names the variable stored in the cell values,
we named that variable rankNote that in the code "week" and "rank" are
quoted because those are new variables we’re creating, they don’t yet
exist in the data when we run the pivot_longer() call.
Now let’s turn our attention to the resulting, longer data frame. What happens if a song is in the top 100 for less than 76 weeks?
Take 2 Pac’s “Baby Don’t Cry”, for example. (poke around for yourself
using View(long.billboard))…
The above output suggests that it was only in the top 100 for 7
weeks, and all the remaining weeks are filled in with missing values.
These NAs don’t really represent unknown observations; they
were forced to exist by the structure of the
long.billboard, so we can ask pivot_longer()
to get rid of them by setting values_drop_na = TRUE:
long.billboard <- billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
head(long.billboard, 10)
## # A tibble: 10 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
The number of rows is now much lower (check that by running
nrow() on both output dfs) – indicating that many rows with
NAs were dropped.
You might also wonder what happens if a song is in the top 100 for
more than 76 weeks? We can’t tell from this data, but you might guess
that additional columns wk77, wk78, … would be
added to the dataset.
This data is now tidy, but we could make future computation a bit
easier by converting values of week from character strings
to numbers using mutate() and
readr::parse_number(). parse_number() is a
handy function that will extract the first number from a string,
ignoring all other text.
billboard.longer.clean <- billboard %>%
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
) %>%
mutate(week = parse_number(week))
head(billboard.longer.clean, 10)
## # A tibble: 10 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <dbl> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 3 92
Now that we have all the week numbers in one variable and all the rank values in another, we’re in a good position to visualize how song ranks vary over time. Here’s some quick code to visualize that very few songs stay in the top 100 for more than 20 weeks.
billboard.longer.clean %>%
ggplot(aes(x = week, y = rank, group = track)) +
geom_line(alpha = 0.25) +
scale_y_reverse()
To gain some more intuition about what pivoting does to the data.
Let’s start with a very simple dataset to make things more transparent.
Suppose we have three patients with ids A, B, and C, and we
take two blood pressure measurements on each patient. We’ll create the
data with tribble(), a function for constructing small
tibbles by hand:
df <- tribble(
~id, ~bp1, ~bp2,
"A", 100, 120,
"B", 140, 115,
"C", 120, 125
)
We want our new dataset to have three variables: id
(already exists), measurement (the column names), and
value (the cell values). To achieve this, we need to pivot
the df longer:
df.long <- df %>%
pivot_longer(
cols = bp1:bp2,
names_to = "measurement",
values_to = "value"
)
df.long
## # A tibble: 6 × 3
## id measurement value
## <chr> <chr> <dbl>
## 1 A bp1 100
## 2 A bp2 120
## 3 B bp1 140
## 4 B bp2 115
## 5 C bp1 120
## 6 C bp2 125
Basically, the values in a column that was already a variable in the
original dataset (id) need to be repeated, once for each
column that is pivoted:
The column names become values in a new variable, whose name is
defined by names_to. They need to be repeated once for each
row in the original dataset:
The cell values also become values in a new variable, with a name
defined by values_to. They are unwound row by row:
A more challenging situation occurs when you have multiple pieces of
information crammed into the column names, and you would like to store
these in separate new variables. For example, take the who2
dataset, the source of table1 and friends that you saw
above:
head(who2, 12)
## # A tibble: 12 × 58
## country year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 1980 NA NA NA NA NA NA
## 2 Afghanistan 1981 NA NA NA NA NA NA
## 3 Afghanistan 1982 NA NA NA NA NA NA
## 4 Afghanistan 1983 NA NA NA NA NA NA
## 5 Afghanistan 1984 NA NA NA NA NA NA
## 6 Afghanistan 1985 NA NA NA NA NA NA
## 7 Afghanistan 1986 NA NA NA NA NA NA
## 8 Afghanistan 1987 NA NA NA NA NA NA
## 9 Afghanistan 1988 NA NA NA NA NA NA
## 10 Afghanistan 1989 NA NA NA NA NA NA
## 11 Afghanistan 1990 NA NA NA NA NA NA
## 12 Afghanistan 1991 NA NA NA NA NA NA
## # ℹ 50 more variables: sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>,
## # sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>,
## # sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>,
## # sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>,
## # sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>,
## # sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>,
## # ep_m_1524 <dbl>, ep_m_2534 <dbl>, ep_m_3544 <dbl>, ep_m_4554 <dbl>, …
This dataset records information about tuberculosis diagnoses. There
are two columns that are already variables and are easy to interpret:
country and year. They are followed by 56(!)
columns like sp_m_014, ep_m_4554, and
rel_m_3544. If you stare at these columns for long enough,
you’ll notice there’s a pattern. Each column name is made up of three
pieces separated by _. The first piece,
sp/rel/ep, describes the method
used for the diagnosis, the second piece, m/f
is the gender (coded as a binary variable in this dataset), and the
third piece,
014/1524/2534/3544/4554/5564/65
is the age range (014 represents 0-14, for
example).
So in this case we have six pieces of information recorded in
who2: the country and the year
(already columns); the method of diagnosis, the gender category, and the
age range category (contained in the other column names); and the count
of patients in that category (cell values). To organize these six pieces
of information in six separate columns, we use
pivot_longer() with a vector of column
names for names_to and instructors for splitting the
original variable names into pieces for names_sep as well
as a column name for values_to:
who2.pivoted <- who2 %>%
pivot_longer(
cols = !(country:year),
names_to = c("diagnosis", "gender", "age"),
names_sep = "_",
values_to = "count"
)
head(who2.pivoted, 10)
## # A tibble: 10 × 6
## country year diagnosis gender age count
## <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 Afghanistan 1980 sp m 014 NA
## 2 Afghanistan 1980 sp m 1524 NA
## 3 Afghanistan 1980 sp m 2534 NA
## 4 Afghanistan 1980 sp m 3544 NA
## 5 Afghanistan 1980 sp m 4554 NA
## 6 Afghanistan 1980 sp m 5564 NA
## 7 Afghanistan 1980 sp m 65 NA
## 8 Afghanistan 1980 sp f 014 NA
## 9 Afghanistan 1980 sp f 1524 NA
## 10 Afghanistan 1980 sp f 2534 NA
(An alternative to names_sep is
names_pattern, which you can use to extract variables from
more complicated naming scenarios if you’re comfortable with Regular
Expressions)
Conceptually, this is only a minor variation on the simpler case you’ve already seen – instead of the column names pivoting into a single column, they pivot into multiple columns. You can imagine this happening in two steps (first pivoting and then separating) but under the hood it actually happens in a single step for efficiency.
While pivot_longer() is useful to solve the common class
of problems where values have ended up in column names, now let’s take a
look at pivot_wider(), which makes datasets, well…
wider by increasing columns and reducing rows and helps
when one observation is spread across multiple rows.
Here’s another mini dataset involving two patients with
ids A and B, we have three blood pressure measurements on
patient A and two on patient B:
df <- tribble(
~id, ~measurement, ~value,
"A", "bp1", 100,
"B", "bp1", 140,
"B", "bp2", 115,
"A", "bp2", 120,
"A", "bp3", 105
)
We’ll take the values from the value column and the
names from the measurement column:
df.wider <- df %>%
pivot_wider(
names_from = measurement,
values_from = value
)
df.wider
## # A tibble: 2 × 4
## id bp1 bp2 bp3
## <chr> <dbl> <dbl> <dbl>
## 1 A 100 120 105
## 2 B 140 115 NA
To begin the process pivot_wider() needs to first figure
out what will go in the rows and columns. The new column names will be
the unique values of measurement.
df %>%
distinct(measurement) %>%
pull()
## [1] "bp1" "bp2" "bp3"
By default, the rows in the output are determined by all the
variables that aren’t going into the new names or values. These are
called the id_cols. Here there is only one column, but in
general there can be any number.
df %>%
select(-measurement, -value) %>%
distinct()
## # A tibble: 2 × 1
## id
## <chr>
## 1 A
## 2 B
pivot_wider() then combines these results to generate an
empty data frame:
df %>%
select(-measurement, -value) %>%
distinct() %>%
mutate(x = NA, y = NA, z = NA)
## # A tibble: 2 × 4
## id x y z
## <chr> <lgl> <lgl> <lgl>
## 1 A NA NA NA
## 2 B NA NA NA
It then fills in all the missing values using the data in the input. In this case, not every cell in the output has a corresponding value in the input as there’s no third blood pressure measurement for patient B, so that cell remains missing.
One common annoyance is what happens if there are multiple rows in
the input that correspond to one cell in the output. The example below
has two rows that correspond to id “A” and measurement
“bp1”:
df <- tribble(
~id, ~measurement, ~value,
"A", "bp1", 100,
"A", "bp1", 102,
"A", "bp2", 120,
"B", "bp1", 140,
"B", "bp2", 115
)
If we attempt to pivot this we get an output that contains list-columns (which we really would rather not deal with… and thus a good sign that we should be organizing or identifying our data differently)
df %>%
pivot_wider(
names_from = measurement,
values_from = value
)
## Warning: Values from `value` are not uniquely identified; output will contain list-cols.
## • Use `values_fn = list` to suppress this warning.
## • Use `values_fn = {summary_fun}` to summarise duplicates.
## • Use the following dplyr code to identify duplicates.
## {data} |>
## dplyr::summarise(n = dplyr::n(), .by = c(id, measurement)) |>
## dplyr::filter(n > 1L)
## # A tibble: 2 × 3
## id bp1 bp2
## <chr> <list> <list>
## 1 A <dbl [2]> <dbl [1]>
## 2 B <dbl [1]> <dbl [1]>
Let’s try to do that together! Getting back
to where we started see if you can write a snippet to tidy-fy
table2 from above
(table3 actually can’t be converted
without using the separate_wider_delim()
function)
Let’s wrap up by working on a larger problem of wrangling and analysis together in Google Colab