Picking up from last time

Last time we were working on joins. A helpful tool for understanding joining syntax/semantics (whether in R or otherwise) is by some quick visualization…


Visualizing Joins

Let’s paint some pictures using really simple example in a hand-crafted tibble. In the below we’ll use a single key called key and a single value column (val_x and val_y), but the ideas all generalize to arbitrary numbers of keys and values.

x <- tribble(
  ~key, ~val_x,
     1, "x1",
     2, "x2",
     3, "x3"
)
y <- tribble(
  ~key, ~val_y,
     1, "y1",
     2, "y2",
     4, "y3"
)
Graphical representation of two simple tables. The colored key columns map background color to key value. The grey columns represent the “value” columns that are carried along for the ride..
Graphical representation of two simple tables. The colored key columns map background color to key value. The grey columns represent the “value” columns that are carried along for the ride..

To understand how joins work, it’s useful to think of every possible match. Here that’s shown with a grid of connecting lines

To describe a specific type of join, we indicate matches with dots. The matches determine the rows in the output, a new data frame that contains the key, the x values, and the y values.

For instance, below we show an innter join, where rows are retained only if and only if the keys are equal

We can apply the same principles to explain the outer joins, which keep observations that appear in at least one of the data frames. These joins work by adding an additional “virtual” observation to each data frame. This observation has a key that matches if no other key matches, and values filled with NA. There are three types of outer joins:

A little Practice

Let’s say we have the following snippet of data

participants <- tibble(
  id = c("P1", "P2", "P3", "P4"),
  age = c(22, 24, 20, 23)
)

sessions <- tibble(
  id = c("P2", "P3", "P5"),
  condition = c("A", "B", "A")
)

participants
## # A tibble: 4 × 2
##   id      age
##   <chr> <dbl>
## 1 P1       22
## 2 P2       24
## 3 P3       20
## 4 P4       23
sessions
## # A tibble: 3 × 2
##   id    condition
##   <chr> <chr>    
## 1 P2    A        
## 2 P3    B        
## 3 P5    A

Let’s try to do that together!


Q: See if you can figure out what will happen if you run the following code:

inner_join(participants, sessions, by = "id")

(Try to work it out before actually running it)


Q: How might that differ if you ran this instead?

left_join(participants, sessions, by = "id")



Pivots

Getting your data into the right format for the analysis you’d like to run can be a task all on its own.

Thankfully there are a few very common transformations that cover the vast majority of cases. So the goal of this part is to get you familiar with the main operation you’ll want to master: converting between wide and long representations of data.


Tidy data

You can represent the same underlying data in multiple ways.

The example below shows the same data organized in three different ways. Each dataset shows the same values of four variables: - country, - year, - population, - tuberculosis (i.e. the number of documented cases of TB)

But each dataset organizes the values in a different way.

table1
## # A tibble: 6 × 4
##   country      year  cases population
##   <chr>       <dbl>  <dbl>      <dbl>
## 1 Afghanistan  1999    745   19987071
## 2 Afghanistan  2000   2666   20595360
## 3 Brazil       1999  37737  172006362
## 4 Brazil       2000  80488  174504898
## 5 China        1999 212258 1272915272
## 6 China        2000 213766 1280428583
table2
## # A tibble: 12 × 4
##    country      year type            count
##    <chr>       <dbl> <chr>           <dbl>
##  1 Afghanistan  1999 cases             745
##  2 Afghanistan  1999 population   19987071
##  3 Afghanistan  2000 cases            2666
##  4 Afghanistan  2000 population   20595360
##  5 Brazil       1999 cases           37737
##  6 Brazil       1999 population  172006362
##  7 Brazil       2000 cases           80488
##  8 Brazil       2000 population  174504898
##  9 China        1999 cases          212258
## 10 China        1999 population 1272915272
## 11 China        2000 cases          213766
## 12 China        2000 population 1280428583
table3
## # A tibble: 6 × 3
##   country      year rate             
##   <chr>       <dbl> <chr>            
## 1 Afghanistan  1999 745/19987071     
## 2 Afghanistan  2000 2666/20595360    
## 3 Brazil       1999 37737/172006362  
## 4 Brazil       2000 80488/174504898  
## 5 China        1999 212258/1272915272
## 6 China        2000 213766/1280428583

Take a look at each of table{1,2,3}. Try and describe what the three formats of the different data are?

What is the mapping between them conceptually.

Tidy data

Just one of them, table1, will be much easier to work with for our purposes because it’s what we call tidy.

(We actually say this a couple weeks ago but it’s worth reiterating)

There are three interrelated rules that make a dataset tidy:

  1. Each variable is a column; each column is a variable.
  2. Each observation is a row; each row is an observation.
  3. Each value is a cell; each cell is a single value.


Visual of tidy data rules
Visual of tidy data rules


Why this particular format?

Why ensure that your data is tidy? There are two main advantages:

  1. There’s a general advantage to picking one consistent way of storing data. If you have a consistent data structure, then your tools that work with it all share an underlying uniformity.
  2. This format is able to take advantage of the vectorized nature of R functions.

dplyr, ggplot2, and other useful packages are designed to work with tidy data. Here are a few small examples showing how you might work with table1.

From table1 it’s relatively straightforward to derive some things:

# Compute rate per 10,000
table1 %>%
  mutate(rate = cases / population * 10000)
## # A tibble: 6 × 5
##   country      year  cases population  rate
##   <chr>       <dbl>  <dbl>      <dbl> <dbl>
## 1 Afghanistan  1999    745   19987071 0.373
## 2 Afghanistan  2000   2666   20595360 1.29 
## 3 Brazil       1999  37737  172006362 2.19 
## 4 Brazil       2000  80488  174504898 4.61 
## 5 China        1999 212258 1272915272 1.67 
## 6 China        2000 213766 1280428583 1.67
# Or compute total cases per year
table1 %>% 
  group_by(year) %>% 
  summarize(total_cases = sum(cases))
## # A tibble: 2 × 2
##    year total_cases
##   <dbl>       <dbl>
## 1  1999      250740
## 2  2000      296920
# We can easily visualize this change over time

ggplot(table1, aes(x = year, y = cases)) +
  geom_line(aes(group = country), color = "grey50", linewidth= 2 ) +
  geom_point(aes(color = country, shape = country, size = 6)) +
  scale_x_continuous(breaks = c(1999, 2000)) + # x-axis breaks at 1999 and 2000 
  theme_bw()



Q: See if you can perform any of the same operations from table2

How would you calculate rate per 10,000 here?


Getting data into tidy format

Lots of data you’ll encounter is not like table1

Data is often organized to facilitate some goal other than analysis. For example, it’s common for data to be structured to make data entry or saving easy, rather than analysis.

This is likely what happens if you’ve run an online experiment (usign PCIbex) or similar.

This means that most real analyses will require at least a little tidying. You’ll begin by figuring out what the underlying variables and observations are. Sometimes this is easy; other times you’ll need to consult some documentation, or the the people who originally generated the data.

Next, you’ll pivot your data into a tidy form, with variables in the columns and observations in the rows.

tidyr provides two functions for pivoting data: 1. pivot_longer() 2. pivot_wider()

We’ll describe each in turn starting with pivot_longer()


Pivoting longer

Data in column names

The billboard dataset records the billboard rank of songs in the year 2000:

head(billboard)
## # A tibble: 6 × 79
##   artist      track date.entered   wk1   wk2   wk3   wk4   wk5   wk6   wk7   wk8
##   <chr>       <chr> <date>       <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac       Baby… 2000-02-26      87    82    72    77    87    94    99    NA
## 2 2Ge+her     The … 2000-09-02      91    87    92    NA    NA    NA    NA    NA
## 3 3 Doors Do… Kryp… 2000-04-08      81    70    68    67    66    57    54    53
## 4 3 Doors Do… Loser 2000-10-21      76    76    72    69    67    65    55    59
## 5 504 Boyz    Wobb… 2000-04-15      57    34    25    17    17    31    36    49
## 6 98^0        Give… 2000-08-19      51    39    34    26    26    19     2     2
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## #   wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## #   wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## #   wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## #   wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## #   wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>,
## #   wk43 <dbl>, wk44 <dbl>, wk45 <dbl>, wk46 <dbl>, wk47 <dbl>, wk48 <dbl>, …

In this dataset, each observation is a song. The first three columns (artist, track and date.entered) are variables that describe the song. Then we have 76 columns (wk1-wk76) that describe the rank of the song in each week1.

Here, the column names are one variable (the week) and the cell values are another (the rank).

To tidy this data, we’ll use pivot_longer():

Start by reading the documentation:

?pivot_longer

Then here’s how it’ll apply to the billboard data

long.billboard <- billboard %>% 
    pivot_longer(
      cols = starts_with("wk"), 
      names_to = "week", 
      values_to = "rank"
    )
head(long.billboard, 10)
## # A tibble: 10 × 5
##    artist track                   date.entered week   rank
##    <chr>  <chr>                   <date>       <chr> <dbl>
##  1 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk1      87
##  2 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk2      82
##  3 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk3      72
##  4 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk4      77
##  5 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk5      87
##  6 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk6      94
##  7 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk7      99
##  8 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk8      NA
##  9 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk9      NA
## 10 2 Pac  Baby Don't Cry (Keep... 2000-02-26   wk10     NA

As described in the documentation, there are three key arguments (aside from the data itself):

  • cols specifies which columns need to be pivoted, i.e. which columns aren’t variables. This argument uses the same syntax as select() so here we could use !c(artist, track, date.entered) or starts_with("wk"). The explanation point (!) before the vector in select tells are what columsn to exclude rather than include
  • names_to names the variable stored in the column names, we named that variable week
  • values_to names the variable stored in the cell values, we named that variable rank

Note that in the code "week" and "rank" are quoted because those are new variables we’re creating, they don’t yet exist in the data when we run the pivot_longer() call.


induced NAs

Now let’s turn our attention to the resulting, longer data frame. What happens if a song is in the top 100 for less than 76 weeks?

Take 2 Pac’s “Baby Don’t Cry”, for example. (poke around for yourself using View(long.billboard))…

The above output suggests that it was only in the top 100 for 7 weeks, and all the remaining weeks are filled in with missing values. These NAs don’t really represent unknown observations; they were forced to exist by the structure of the long.billboard, so we can ask pivot_longer() to get rid of them by setting values_drop_na = TRUE:

long.billboard <- billboard %>%
    pivot_longer(
      cols = starts_with("wk"), 
      names_to = "week", 
      values_to = "rank",
      values_drop_na = TRUE
    )
head(long.billboard, 10)
## # A tibble: 10 × 5
##    artist  track                   date.entered week   rank
##    <chr>   <chr>                   <date>       <chr> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk1      87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk2      82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk3      72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk4      77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk5      87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk6      94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26   wk7      99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02   wk1      91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02   wk2      87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02   wk3      92

The number of rows is now much lower (check that by running nrow() on both output dfs) – indicating that many rows with NAs were dropped.

You might also wonder what happens if a song is in the top 100 for more than 76 weeks? We can’t tell from this data, but you might guess that additional columns wk77, wk78, … would be added to the dataset.

This data is now tidy, but we could make future computation a bit easier by converting values of week from character strings to numbers using mutate() and readr::parse_number(). parse_number() is a handy function that will extract the first number from a string, ignoring all other text.

billboard.longer.clean <- billboard %>% 
    pivot_longer(
      cols = starts_with("wk"), 
      names_to = "week", 
      values_to = "rank",
      values_drop_na = TRUE
    ) %>% 
      mutate(week = parse_number(week))
head(billboard.longer.clean, 10)
## # A tibble: 10 × 5
##    artist  track                   date.entered  week  rank
##    <chr>   <chr>                   <date>       <dbl> <dbl>
##  1 2 Pac   Baby Don't Cry (Keep... 2000-02-26       1    87
##  2 2 Pac   Baby Don't Cry (Keep... 2000-02-26       2    82
##  3 2 Pac   Baby Don't Cry (Keep... 2000-02-26       3    72
##  4 2 Pac   Baby Don't Cry (Keep... 2000-02-26       4    77
##  5 2 Pac   Baby Don't Cry (Keep... 2000-02-26       5    87
##  6 2 Pac   Baby Don't Cry (Keep... 2000-02-26       6    94
##  7 2 Pac   Baby Don't Cry (Keep... 2000-02-26       7    99
##  8 2Ge+her The Hardest Part Of ... 2000-09-02       1    91
##  9 2Ge+her The Hardest Part Of ... 2000-09-02       2    87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02       3    92

Now that we have all the week numbers in one variable and all the rank values in another, we’re in a good position to visualize how song ranks vary over time. Here’s some quick code to visualize that very few songs stay in the top 100 for more than 20 weeks.

billboard.longer.clean %>% 
    ggplot(aes(x = week, y = rank, group = track)) + 
    geom_line(alpha = 0.25) + 
    scale_y_reverse()

How does that actually work though?

To gain some more intuition about what pivoting does to the data. Let’s start with a very simple dataset to make things more transparent. Suppose we have three patients with ids A, B, and C, and we take two blood pressure measurements on each patient. We’ll create the data with tribble(), a function for constructing small tibbles by hand:

df <- tribble(
  ~id,  ~bp1, ~bp2,
   "A",  100,  120,
   "B",  140,  115,
   "C",  120,  125
)

We want our new dataset to have three variables: id (already exists), measurement (the column names), and value (the cell values). To achieve this, we need to pivot the df longer:

df.long <- df %>% 
    pivot_longer(
      cols = bp1:bp2,
      names_to = "measurement",
      values_to = "value"
    )
df.long
## # A tibble: 6 × 3
##   id    measurement value
##   <chr> <chr>       <dbl>
## 1 A     bp1           100
## 2 A     bp2           120
## 3 B     bp1           140
## 4 B     bp2           115
## 5 C     bp1           120
## 6 C     bp2           125

Basically, the values in a column that was already a variable in the original dataset (id) need to be repeated, once for each column that is pivoted:

The column names become values in a new variable, whose name is defined by names_to. They need to be repeated once for each row in the original dataset:

The cell values also become values in a new variable, with a name defined by values_to. They are unwound row by row:


Many variables in column names

A more challenging situation occurs when you have multiple pieces of information crammed into the column names, and you would like to store these in separate new variables. For example, take the who2 dataset, the source of table1 and friends that you saw above:

head(who2, 12)
## # A tibble: 12 × 58
##    country      year sp_m_014 sp_m_1524 sp_m_2534 sp_m_3544 sp_m_4554 sp_m_5564
##    <chr>       <dbl>    <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
##  1 Afghanistan  1980       NA        NA        NA        NA        NA        NA
##  2 Afghanistan  1981       NA        NA        NA        NA        NA        NA
##  3 Afghanistan  1982       NA        NA        NA        NA        NA        NA
##  4 Afghanistan  1983       NA        NA        NA        NA        NA        NA
##  5 Afghanistan  1984       NA        NA        NA        NA        NA        NA
##  6 Afghanistan  1985       NA        NA        NA        NA        NA        NA
##  7 Afghanistan  1986       NA        NA        NA        NA        NA        NA
##  8 Afghanistan  1987       NA        NA        NA        NA        NA        NA
##  9 Afghanistan  1988       NA        NA        NA        NA        NA        NA
## 10 Afghanistan  1989       NA        NA        NA        NA        NA        NA
## 11 Afghanistan  1990       NA        NA        NA        NA        NA        NA
## 12 Afghanistan  1991       NA        NA        NA        NA        NA        NA
## # ℹ 50 more variables: sp_m_65 <dbl>, sp_f_014 <dbl>, sp_f_1524 <dbl>,
## #   sp_f_2534 <dbl>, sp_f_3544 <dbl>, sp_f_4554 <dbl>, sp_f_5564 <dbl>,
## #   sp_f_65 <dbl>, sn_m_014 <dbl>, sn_m_1524 <dbl>, sn_m_2534 <dbl>,
## #   sn_m_3544 <dbl>, sn_m_4554 <dbl>, sn_m_5564 <dbl>, sn_m_65 <dbl>,
## #   sn_f_014 <dbl>, sn_f_1524 <dbl>, sn_f_2534 <dbl>, sn_f_3544 <dbl>,
## #   sn_f_4554 <dbl>, sn_f_5564 <dbl>, sn_f_65 <dbl>, ep_m_014 <dbl>,
## #   ep_m_1524 <dbl>, ep_m_2534 <dbl>, ep_m_3544 <dbl>, ep_m_4554 <dbl>, …

This dataset records information about tuberculosis diagnoses. There are two columns that are already variables and are easy to interpret: country and year. They are followed by 56(!) columns like sp_m_014, ep_m_4554, and rel_m_3544. If you stare at these columns for long enough, you’ll notice there’s a pattern. Each column name is made up of three pieces separated by _. The first piece, sp/rel/ep, describes the method used for the diagnosis, the second piece, m/f is the gender (coded as a binary variable in this dataset), and the third piece, 014/1524/2534/3544/4554/5564/65 is the age range (014 represents 0-14, for example).

So in this case we have six pieces of information recorded in who2: the country and the year (already columns); the method of diagnosis, the gender category, and the age range category (contained in the other column names); and the count of patients in that category (cell values). To organize these six pieces of information in six separate columns, we use pivot_longer() with a vector of column names for names_to and instructors for splitting the original variable names into pieces for names_sep as well as a column name for values_to:

who2.pivoted <- who2 %>% 
      pivot_longer(
        cols = !(country:year),
        names_to = c("diagnosis", "gender", "age"), 
        names_sep = "_",
        values_to = "count"
      )
head(who2.pivoted, 10)
## # A tibble: 10 × 6
##    country      year diagnosis gender age   count
##    <chr>       <dbl> <chr>     <chr>  <chr> <dbl>
##  1 Afghanistan  1980 sp        m      014      NA
##  2 Afghanistan  1980 sp        m      1524     NA
##  3 Afghanistan  1980 sp        m      2534     NA
##  4 Afghanistan  1980 sp        m      3544     NA
##  5 Afghanistan  1980 sp        m      4554     NA
##  6 Afghanistan  1980 sp        m      5564     NA
##  7 Afghanistan  1980 sp        m      65       NA
##  8 Afghanistan  1980 sp        f      014      NA
##  9 Afghanistan  1980 sp        f      1524     NA
## 10 Afghanistan  1980 sp        f      2534     NA

(An alternative to names_sep is names_pattern, which you can use to extract variables from more complicated naming scenarios if you’re comfortable with Regular Expressions)

Conceptually, this is only a minor variation on the simpler case you’ve already seen – instead of the column names pivoting into a single column, they pivot into multiple columns. You can imagine this happening in two steps (first pivoting and then separating) but under the hood it actually happens in a single step for efficiency.



Pivot wider

While pivot_longer() is useful to solve the common class of problems where values have ended up in column names, now let’s take a look at pivot_wider(), which makes datasets, well… wider by increasing columns and reducing rows and helps when one observation is spread across multiple rows.

Here’s another mini dataset involving two patients with ids A and B, we have three blood pressure measurements on patient A and two on patient B:

df <- tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "B",        "bp1",    140,
  "B",        "bp2",    115, 
  "A",        "bp2",    120,
  "A",        "bp3",    105
)

We’ll take the values from the value column and the names from the measurement column:

df.wider <-  df %>% 
    pivot_wider(
      names_from = measurement,
      values_from = value
    )
df.wider
## # A tibble: 2 × 4
##   id      bp1   bp2   bp3
##   <chr> <dbl> <dbl> <dbl>
## 1 A       100   120   105
## 2 B       140   115    NA

How does this work?

To begin the process pivot_wider() needs to first figure out what will go in the rows and columns. The new column names will be the unique values of measurement.

df %>%
  distinct(measurement) %>% 
  pull()
## [1] "bp1" "bp2" "bp3"

By default, the rows in the output are determined by all the variables that aren’t going into the new names or values. These are called the id_cols. Here there is only one column, but in general there can be any number.

df %>%
   select(-measurement, -value) %>% 
  distinct()
## # A tibble: 2 × 1
##   id   
##   <chr>
## 1 A    
## 2 B

pivot_wider() then combines these results to generate an empty data frame:

df %>%
   select(-measurement, -value) %>% 
  distinct() %>% 
  mutate(x = NA, y = NA, z = NA)
## # A tibble: 2 × 4
##   id    x     y     z    
##   <chr> <lgl> <lgl> <lgl>
## 1 A     NA    NA    NA   
## 2 B     NA    NA    NA

It then fills in all the missing values using the data in the input. In this case, not every cell in the output has a corresponding value in the input as there’s no third blood pressure measurement for patient B, so that cell remains missing.

One pitfall

One common annoyance is what happens if there are multiple rows in the input that correspond to one cell in the output. The example below has two rows that correspond to id “A” and measurement “bp1”:

df <- tribble(
  ~id, ~measurement, ~value,
  "A",        "bp1",    100,
  "A",        "bp1",    102,
  "A",        "bp2",    120,
  "B",        "bp1",    140, 
  "B",        "bp2",    115
)

If we attempt to pivot this we get an output that contains list-columns (which we really would rather not deal with… and thus a good sign that we should be organizing or identifying our data differently)

df %>%
  pivot_wider(
    names_from = measurement,
    values_from = value
  )
## Warning: Values from `value` are not uniquely identified; output will contain list-cols.
## • Use `values_fn = list` to suppress this warning.
## • Use `values_fn = {summary_fun}` to summarise duplicates.
## • Use the following dplyr code to identify duplicates.
##   {data} |>
##   dplyr::summarise(n = dplyr::n(), .by = c(id, measurement)) |>
##   dplyr::filter(n > 1L)
## # A tibble: 2 × 3
##   id    bp1       bp2      
##   <chr> <list>    <list>   
## 1 A     <dbl [2]> <dbl [1]>
## 2 B     <dbl [1]> <dbl [1]>


A little more practice

Let’s try to do that together! Getting back to where we started see if you can write a snippet to tidy-fy table2 from above

(table3 actually can’t be converted without using the separate_wider_delim() function)



A full problem in Colab

Let’s wrap up by working on a larger problem of wrangling and analysis together in Google Colab