Picking up from last time

We said last time that we’d work on joining dfs and pivots (longer vs. wider)… and we will… but first some review from HW2

Working with the data from CHT (2021)

Reading in the data to start

After you read in the data, you should filter to include only the trials from “Experiment 2”

Print out the number of rows you get in the original data and then in the Exp 2 only data (the latter should have 24948) trials

CHT.all.df <- read_tsv("https://spencercaplan.org/teaching/F25/Statistics/data/CHT_data_source.tsv", show_col_types = FALSE)
str(CHT.all.df)

## spc_tbl_ [43,416 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ uniqueid        : chr [1:43416] "debug09STXB:debugACSEQ7" "debug09STXB:debugACSEQ7" "debug09STXB:debugACSEQ7" "debug09STXB:debugACSEQ7" ...
##  $ VOT             : num [1:43416] 20 60 40 55 50 60 30 70 40 30 ...
##  $ blockTrialNum   : num [1:43416] 1 2 3 4 5 6 7 8 22 23 ...
##  $ condition       : num [1:43416] 0 0 0 0 0 0 0 0 0 0 ...
##  $ index           : num [1:43416] 1 1 2 1 2 2 1 1 1 1 ...
##  $ rt              : num [1:43416] 1320 682 698 568 414 ...
##  $ trialNum        : num [1:43416] 143 144 145 146 147 148 149 150 164 165 ...
##  $ experimentName  : chr [1:43416] "CHT Exp1 (in person)" "CHT Exp1 (in person)" "CHT Exp1 (in person)" "CHT Exp1 (in person)" ...
##  $ contextCondition: chr [1:43416] "L" "L" "L" "L" ...
##  $ ambigPhoneme    : chr [1:43416] "t" "t" "t" "t" ...
##  $ t_choice        : num [1:43416] 0 1 1 1 1 1 0 1 1 0 ...
##  $ blockNum        : num [1:43416] 1 1 1 1 1 1 1 1 2 2 ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   uniqueid = col_character(),
##   ..   VOT = col_double(),
##   ..   blockTrialNum = col_double(),
##   ..   condition = col_double(),
##   ..   index = col_double(),
##   ..   rt = col_double(),
##   ..   trialNum = col_double(),
##   ..   experimentName = col_character(),
##   ..   contextCondition = col_character(),
##   ..   ambigPhoneme = col_character(),
##   ..   t_choice = col_double(),
##   ..   blockNum = col_double()
##   .. )
##  - attr(*, "problems")=<externalptr>

CHT.exp2.df <- CHT.all.df %>% filter(experimentName == "CHT Exp2 (online)")
nrow(CHT.all.df)

## [1] 43416

nrow(CHT.exp2.df)

## [1] 24948

CHT.exp2.df$uniqueid <- as.factor(CHT.exp2.df$uniqueid)
summary(CHT.exp2.df)

##                                           uniqueid          VOT    
##  A10JXOU89D5RXR:36W0OB37HWEXSVOLO8IISUGDNCTZHJ:  162   Min.   :20  
##  A114BI2ZRYWKC0:3OCHAWUVGOKZPQPETBXU81GU7TMKXY:  162   1st Qu.:40  
##  A11Q8U6QTT8KGF:3U088ZLJVKTIN0DKFDRQNYNEKSO0WR:  162   Median :50  
##  A125AOX978LDG7:37U1UTWH9VMVXT11BNUZTELF879R8F:  162   Mean   :50  
##  A12E4Z8DBFUL29:324G5B4FB383XLCJ75JEVIOXOYV07W:  162   3rd Qu.:60  
##  A1561P9VVA3C1C:382M9COHEHF4MM39SKB4QZ4LSAGEUE:  162   Max.   :80  
##  (Other)                                      :23976               
##  blockTrialNum     condition         index           rt          
##  Min.   :  1.0   Min.   :0.000   Min.   :1.0   Min.   :     0.0  
##  1st Qu.: 41.0   1st Qu.:0.000   1st Qu.:1.0   1st Qu.:   174.9  
##  Median : 81.5   Median :1.000   Median :1.5   Median :   278.9  
##  Mean   : 81.5   Mean   :1.831   Mean   :1.5   Mean   :   396.2  
##  3rd Qu.:122.0   3rd Qu.:3.000   3rd Qu.:2.0   3rd Qu.:   421.8  
##  Max.   :162.0   Max.   :4.000   Max.   :2.0   Max.   :154615.2  
##                                                                  
##     trialNum     experimentName     contextCondition   ambigPhoneme      
##  Min.   :143.0   Length:24948       Length:24948       Length:24948      
##  1st Qu.:183.0   Class :character   Class :character   Class :character  
##  Median :223.5   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :223.5                                                           
##  3rd Qu.:264.0                                                           
##  Max.   :304.0                                                           
##                                                                          
##     t_choice         blockNum
##  Min.   :0.0000   Min.   :1  
##  1st Qu.:0.0000   1st Qu.:3  
##  Median :1.0000   Median :5  
##  Mean   :0.6193   Mean   :5  
##  3rd Qu.:1.0000   3rd Qu.:7  
##  Max.   :1.0000   Max.   :9  
##

# Or more specifically
is.factor(CHT.exp2.df$uniqueid)

## [1] TRUE

counts <- CHT.exp2.df %>% distinct(uniqueid, contextCondition, ambigPhoneme) %>% 
  group_by(contextCondition, ambigPhoneme) %>%
  summarise(n_subjects = n())

## `summarise()` has grouped output by 'contextCondition'. You can override using
## the `.groups` argument.

kable(counts)

contextCondition	ambigPhoneme	n_subjects
L	d	39
L	t	46
R	d	36
R	t	33

sum(counts$n_subjects)

## [1] 154

Now on to prepping the data frame for the main figure(s)

group.level.before <- CHT.exp2.df %>%
  filter(contextCondition == 'L') %>%
  group_by(ambigPhoneme, VOT)  %>%
  summarize(n = n(),
            n_t = sum(t_choice),
            p = mean(t_choice),
            se = sqrt(p * (1 - p) / n), # ,   # standard error
            ci_low = p - 1.96 * se,       # 95% lower bound
            ci_high = p + 1.96 * se       # 95% upper bound
            )

## `summarise()` has grouped output by 'ambigPhoneme'. You can override using the
## `.groups` argument.

group.level.after <- CHT.exp2.df %>%
  filter(contextCondition == 'R') %>%
  group_by(ambigPhoneme, VOT)  %>%
  summarize(n = n(),
            n_t = sum(t_choice),
            p = mean(t_choice),
            se = sqrt(p * (1 - p) / n), # ,   # standard error
            ci_low = p - 1.96 * se,       # 95% lower bound
            ci_high = p + 1.96 * se       # 95% upper bound
            )

## `summarise()` has grouped output by 'ambigPhoneme'. You can override using the
## `.groups` argument.

ambigPhoneme	VOT	n	n_t	p	se	ci_low	ci_high
d	20	702	10	0.0142450	0.0044725	0.0054790	0.0230111
d	30	702	51	0.0726496	0.0097965	0.0534485	0.0918507
d	40	702	198	0.2820513	0.0169841	0.2487625	0.3153401
d	45	702	345	0.4914530	0.0188685	0.4544707	0.5284353
d	50	702	492	0.7008547	0.0172817	0.6669826	0.7347268
d	55	702	583	0.8304843	0.0141613	0.8027282	0.8582404
d	60	702	635	0.9045584	0.0110897	0.8828227	0.9262942
d	70	702	680	0.9686610	0.0065760	0.9557721	0.9815499
d	80	702	695	0.9900285	0.0037500	0.9826784	0.9973786
t	20	828	10	0.0120773	0.0037960	0.0046371	0.0195175
t	30	828	111	0.1340580	0.0118407	0.1108503	0.1572656
t	40	828	325	0.3925121	0.0169699	0.3592510	0.4257731
t	45	828	475	0.5736715	0.0171865	0.5399859	0.6073571
t	50	828	624	0.7536232	0.0149748	0.7242725	0.7829739
t	55	828	739	0.8925121	0.0107640	0.8714147	0.9136094
t	60	828	775	0.9359903	0.0085064	0.9193179	0.9526628
t	70	828	814	0.9830918	0.0044805	0.9743099	0.9918737
t	80	828	818	0.9879227	0.0037960	0.9804825	0.9953629

ambigPhoneme	VOT	n	n_t	p	se	ci_low	ci_high
d	20	648	9	0.0138889	0.0045974	0.0048780	0.0228997
d	30	648	90	0.1388889	0.0135855	0.1122613	0.1655165
d	40	648	243	0.3750000	0.0190181	0.3377244	0.4122756
d	45	648	397	0.6126543	0.0191368	0.5751462	0.6501625
d	50	648	496	0.7654321	0.0166456	0.7328067	0.7980575
d	55	648	581	0.8966049	0.0119609	0.8731616	0.9200483
d	60	648	615	0.9490741	0.0086364	0.9321468	0.9660014
d	70	648	636	0.9814815	0.0052961	0.9711011	0.9918618
d	80	648	640	0.9876543	0.0043378	0.9791522	0.9961565
t	20	594	12	0.0202020	0.0057726	0.0088877	0.0315163
t	30	594	85	0.1430976	0.0143677	0.1149369	0.1712584
t	40	594	223	0.3754209	0.0198683	0.3364791	0.4143627
t	45	594	356	0.5993266	0.0201064	0.5599181	0.6387351
t	50	594	456	0.7676768	0.0173278	0.7337144	0.8016392
t	55	594	516	0.8686869	0.0138577	0.8415257	0.8958480
t	60	594	556	0.9360269	0.0100404	0.9163478	0.9557061
t	70	594	576	0.9696970	0.0070334	0.9559114	0.9834825
t	80	594	584	0.9831650	0.0052787	0.9728187	0.9935112

Actual plotting

CHT.before.pl <- ggplot(group.level.before,
             aes(x=VOT, y=p, color=ambigPhoneme, group=ambigPhoneme)) +
  geom_hline(yintercept=0,colour = "black",linewidth=1) +
  geom_line(linewidth=3) + 
  geom_errorbar(width=0.05, size=3, aes(ymin=ci_low, ymax=ci_high)) +
  geom_point(pch=21, fill='white', size = 3) +
  ylab('Proportion /t/ choice') +
  xlab('VOT (ms)') +
  ggtitle('CHT Text-Before') +
  FormSense_theme() +
  scale_fill_manual(values=c("#5e3c99", "#e66101"), aesthetics = c("color", "fill"), name = "Shifted\nPhone")

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
## 3.5.0.
## ℹ Please use the `legend.position.inside` argument of `theme()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

CHT.before.pl

# ggsave(plot = CHT.before.pl, "HW2-CHT-before.png",
#        width = 12, height = 8, units = "in")

CHT.after.pl <- ggplot(group.level.after,
             aes(x=VOT, y=p, color=ambigPhoneme, group=ambigPhoneme)) +
  geom_hline(yintercept=0,colour = "black",linewidth=1) +
  geom_line(linewidth=3) + 
  geom_errorbar(width=0.05, size=3, aes(ymin=ci_low, ymax=ci_high)) +
  geom_point(shape=21, fill='white', size = 3) +
  ylab('Proportion /t/ choice') +
  xlab('VOT (ms)') +
  ggtitle('CHT Text-After') +
  FormSense_theme() +
  scale_fill_manual(values=c("#5e3c99", "#e66101"), aesthetics = c("color", "fill"), name = "Shifted\nPhone")
CHT.after.pl

# ggsave(plot = CHT.after.pl, "HW2-CHT-after.png",
#        width = 12, height = 8, units = "in")

Simpler solution with facet_wrap

We hadn’t covered it yet at the time the assignment was released, but now that we know how to use facet_wrap we can actually produce the combined plot in one go!

group.level.all <- CHT.exp2.df %>%
  group_by(ambigPhoneme, contextCondition, VOT)  %>%
  summarize(n = n(),
            n_t = sum(t_choice),
            p = mean(t_choice),
            se = sqrt(p * (1 - p) / n), # ,   # standard error
            ci_low = p - 1.96 * se,       # 95% lower bound
            ci_high = p + 1.96 * se       # 95% upper bound
            )

## `summarise()` has grouped output by 'ambigPhoneme', 'contextCondition'. You can
## override using the `.groups` argument.

CHT.sidebyside.pl <- ggplot(group.level.all,
             aes(x=VOT, y=p, color=ambigPhoneme, group=ambigPhoneme)) +
  geom_hline(yintercept=0,colour = "black",linewidth=1) +
  geom_line(linewidth=3) + 
  geom_errorbar(width=0.05, size=3, aes(ymin=ci_low, ymax=ci_high)) +
  geom_point(pch=21, fill='white', size = 3) +
  facet_wrap(~contextCondition,
             labeller = labeller(contextCondition = c(
      "L" = "Text-Before",
      "R" = "Text-After"
    ))) + 
  ylab('Proportion /t/ choice') +
  xlab('VOT (ms)') +
  ggtitle('CHT Main Effect') +
  FormSense_theme() +
  scale_fill_manual(values=c("#5e3c99", "#e66101"), aesthetics = c("color", "fill"), name = "Shifted\nPhone")
CHT.sidebyside.pl

# ggsave(plot = CHT.before.pl, "HW2-CHT-before.png",
#        width = 12, height = 8, units = "in")

Optional Stretch Goal

Here are the numbers from the classic Pisoni & Tash (1974) paper, so first let’s see how we’d plot them

#### Pisoni & Tash (1974) plot
PT.extracted.data <- data.frame(
  VOT = c(0, 10, 20, 30, 40, 50, 60),
  rt = c(500, 494, 544, 580, 478, 481, 482) 
)

PT.extracted.data <- PT.extracted.data %>%
      mutate(logRT = log(rt)) %>%
      mutate(zRT = (logRT - mean(logRT)) / sd(logRT))

pisoni_style_plot <- ggplot(PT.extracted.data, aes(x = VOT, y = rt)) +
  geom_line(linetype = "dashed", color = "black", linewidth = 2) +
  geom_point(size = 5, color = "black", shape = 24, fill = "black") +
  labs(
    title = "Replotting Empirical RTs\nfrom Pisoni & Tash (1974)",
    x = "Raw VOT",
    y = "Mean RT  (ms)"
  ) +
  scale_x_continuous(breaks = seq(from = 0,
                                  to = 60,
                                  by = 10)) + 
  geom_vline(xintercept = 30, color = "steelblue", linetype = "dashed", linewidth = 1) + 
  single_pane_theme() +
  scale_y_continuous(limits = c(450, 600))
pisoni_style_plot

Let’s apply that basic outline here.

CHT.rts.to.plot <- CHT.exp2.df %>% 
          group_by(VOT) %>%
          summarise(meanRT = mean(rt), .groups = "drop")

pl.messy <- ggplot(data = CHT.rts.to.plot, aes(x = VOT, y = meanRT)) +
  geom_line(linetype = "dashed", color = "black", linewidth = 2) +
  geom_point(size = 5, color = "black", shape = 24, fill = "black") +
  labs(title = "Empirical RTs from CHT\nas a function of VOT level",
       y = "Raw RT",
       x = "Raw VOT",
       color = "Experiment / Condition") + 
  scale_x_continuous(breaks = seq(from = 10,
                                  to = 85,
                                  by = 15)) + 
  scale_color_viridis_d() +
  single_pane_theme() + 
  theme(strip.text = element_text(size = axisTextSize, face = "bold"))
pl.messy

Cool!

This plot is just means, but we really would like some confidence intervals on each point to see visually whether this seems robust given the spread of the distribution.

Let’s try to do that together!

Hint: - We’re dealing with continuous rather than binomial trials now,… - Thus we can probably model this via a t-distribution - Recall the qt() function for calculating the qualtile function over the t-distribution

(In the markdown you’ll see there is a code cell that’s hidden DO NOT EXPAND this until you first try figuring it out yourself / collaboratively)

Don’t continue until you try this yourself!.

CHT.rts.to.plot <- CHT.exp2.df %>% 
          group_by(VOT) %>%
          summarise(n = n(),
                    meanRT = mean(rt),
                    se = sd(rt) / sqrt(n), # ,   # standard error
                    ci_low = meanRT - qt(0.975, df = n - 1) * se,       # 95% lower bound
                    ci_high = meanRT + qt(0.975, df = n - 1) * se       # 95% upper bound
                    ,  .groups = "drop"
                    )

pl.messy.withCIs <- ggplot(data = CHT.rts.to.plot, aes(x = VOT, y = meanRT)) +
  geom_line(linetype = "dashed", color = "black", linewidth = 2) +
  geom_point(size = 5, color = "black", shape = 24, fill = "black") +
  geom_errorbar(width=0.05, size=3, aes(ymin=ci_low, ymax=ci_high)) +
  labs(title = "Empirical RTs from CHT\nas a function of VOT level",
       y = "Raw RT",
       x = "Raw VOT",
       color = "Experiment / Condition") + 
  scale_x_continuous(breaks = seq(from = 10,
                                  to = 85,
                                  by = 15)) + 
  scale_color_viridis_d() +
  single_pane_theme() + 
  theme(strip.text = element_text(size = axisTextSize, face = "bold"))
pl.messy.withCIs

Woah! Looks like there’s a huge spread over the RTs specifically at the VOT=40ms level.

Let’s see what could be going on:

## Seems like the spread is very wide at VOT=40
## Maybe worth exploring more:

# We'll filter to just the trials we want
CHT.exp2.df.VOT40 <- CHT.exp2.df %>% 
  filter(VOT == 40) %>%
  select(c(uniqueid, VOT, rt))

## Let's try plotting the distribution at the trial level:
ggplot(CHT.exp2.df.VOT40, aes(x = rt)) +
  geom_density(fill = "skyblue", alpha = 0.5)

Aha, seems like the culprit must be some really long trial where the participant never hit next, or perhaps this was misrecorded by the experimental software. Either way we should filter it out to see what the distribution looks like without this outlier:

## First confirming the outlier by checking the distribution
summary(CHT.exp2.df.VOT40$rt)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
##      1.33    182.00    296.02    465.93    470.00 154615.22

     # Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
     # 1.33    182.00    296.02    465.93    470.00 154615.22 

# Distribution looks like we'd expect except for the Max value (even the third quartile isn't so extreme)

CHT.exp2.df.VOT40.nooutlier <- CHT.exp2.df.VOT40 %>% filter(rt < 1000)
summary(CHT.exp2.df.VOT40.nooutlier$rt)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.325 174.025 283.567 319.380 418.226 997.220

ggplot(CHT.exp2.df.VOT40.nooutlier, aes(x = rt)) +
  geom_density(fill = "skyblue", alpha = 0.5)

That looks more reasonable. Let’s repeat the above workflow adding CIs but after removing these outliers.

CHT.rts.no.outliers.to.plot <- CHT.exp2.df %>% 
          filter(rt < 1000) %>%
          group_by(VOT) %>%
          summarise(n = n(),
                    meanRT = mean(rt),
                    se = sd(rt) / sqrt(n), # ,   # standard error
                    ci_low = meanRT - qt(0.975, df = n - 1) * se,       # 95% lower bound
                    ci_high = meanRT + qt(0.975, df = n - 1) * se       # 95% upper bound
                    ,  .groups = "drop"
                    )

pl.messy.withCIs.no.outliers <- ggplot(data = CHT.rts.no.outliers.to.plot, aes(x = VOT, y = meanRT)) +
  geom_line(linetype = "dashed", color = "black", linewidth = 2) +
  geom_point(size = 5, color = "black", shape = 24, fill = "black") +
  geom_errorbar(width=0.05, size=3, aes(ymin=ci_low, ymax=ci_high)) +
  labs(title = "Empirical RTs from CHT\nas a function of VOT level",
       y = "Raw RT",
       x = "Raw VOT",
       color = "Experiment / Condition") + 
  scale_x_continuous(breaks = seq(from = 10,
                                  to = 85,
                                  by = 15)) + 
  scale_color_viridis_d() +
  single_pane_theme() + 
  theme(strip.text = element_text(size = axisTextSize, face = "bold"))
pl.messy.withCIs.no.outliers

Why did I mention this was cool again?

The other reason I described these results as “cool” is because: you’ll note that the original experiment wasn’t designed with processing speed in mind, nor was it a variable that was manipulated directly. Nevertheless, we see that the trend from Pisoni & Tash (1974) is replicated: participants are slowest to respect when judging input near the category boundary and faster to respond as the phonetic input moves away from the category boundary (in either direction).

Practicing a statistical test

Could we test this statistically? Let’s work on this together!

TRY TO WORK OUT TOGETHER HOW TO DO THIS :)

We haven’t actually covered testing differences between two samples yet… but how about we calculate the mean RT of the entire sample (not including the VOT=40ms level so that we don’t double count), and let’s (for now) pretend that is our “population”
Then treat VOT=40ms as our “sample” and test the null hypothesis that the non-40ms and 40ms RTs are the same

Exploring more

Let’s see what else is going on, if the slowdown is really based on the category boundary then we should see the RT peak move around as a result of perceptual learning for the participants whose categorization boundaries shifted.

Does that happen?

First let’s add a column to indicate which subjects had a shifted category boundary or not (more specifically than the text-before group, it’s actually most apparent in the text-before shifted-/d/ participants)

Then we’ll calculate the mean RT per VOT level again.

CHT.all.df.shifted <- CHT.exp2.df %>%
  mutate(shift_group = case_when(
    contextCondition == "L" & ambigPhoneme == 'd' ~ "CHT right-shift",
    TRUE ~ "CHT no-shift"
  ))

CHT.rts.to.plot <- CHT.all.df.shifted %>% 
          group_by(VOT, shift_group) %>%
          summarise(n = n(),
                    meanRT = mean(rt),
                    se = sd(rt) / sqrt(n), # ,   # standard error
                    ci_low = meanRT - qt(0.975, df = n - 1) * se,       # 95% lower bound
                    ci_high = meanRT + qt(0.975, df = n - 1) * se       # 95% upper bound
                    ,  .groups = "drop"
                    )

pl.messy.by.group <- ggplot(data = CHT.rts.to.plot, aes(x = VOT, y = meanRT, group = shift_group, color = shift_group)) +
  geom_line(linetype = "dashed", linewidth = 2) +
  geom_point(size = 5, shape = 24, fill = "white") +
  labs(title = "Empirical RTs from CHT",
       y = "Raw RT",
       x = "Raw VOT",
       color = "Experiment / Condition") + 
  scale_x_continuous(breaks = seq(from = min(CHT.all.df.shifted$VOT),
                                  to = max(CHT.all.df.shifted$VOT),
                                  by = 15)) + 
  scale_color_viridis_d() +
  single_pane_theme() + 
  theme(legend.position = "right") +
  theme(strip.text = element_text(size = axisTextSize, face = "bold"))
pl.messy.by.group

There’s some overall difference in RTs by group… let’s z-score things so they’re more comparable

# Z-score the log-rts within each condition for comparability
CHT.all.df.shifted <- CHT.all.df.shifted %>% filter(rt > 0) %>%
          group_by(shift_group) %>%
          mutate(logRT = log(rt)) %>%
          mutate(zRT = (logRT - mean(logRT)) / sd(logRT)) %>%
          ungroup()

CHT.rts.normalized.to.plot <- CHT.all.df.shifted %>% 
          group_by(VOT, shift_group) %>%
          summarise(n = n(),
                    meanRT = mean(zRT),
                    se = sd(zRT) / sqrt(n), # ,   # standard error
                    ci_low = meanRT - qt(0.975, df = n - 1) * se,       # 95% lower bound
                    ci_high = meanRT + qt(0.975, df = n - 1) * se       # 95% upper bound
                    ,  .groups = "drop"
                    )


pl.normed.by.group <- ggplot(data = CHT.rts.normalized.to.plot, aes(x = VOT, y = meanRT, group = shift_group, color = shift_group)) +
  geom_line(linetype = "dashed", linewidth = 2) +
  geom_point(size = 5, shape = 24, fill = "white") +
  labs(title = "Empirical RTs from CHT",
       y = "Raw RT",
       x = "Raw VOT",
       color = "Experiment / Condition") + 
  scale_x_continuous(breaks = seq(from = min(CHT.all.df.shifted$VOT),
                                  to = max(CHT.all.df.shifted$VOT),
                                  by = 15)) + 
  scale_color_viridis_d() +
  single_pane_theme() + 
  theme(legend.position = "bottom") +
  theme(strip.text = element_text(size = axisTextSize, face = "bold"))
pl.normed.by.group

Looks like it! (It’s just a boundary shift of about 5-10ms anyway…) But to really evaluate this difference we’d want to control for more factors… to be continued

In the meantime let’s look at some other data manipulations that will come in handy

Joining dfs

It’s rare that a data analysis involves only a single data frame. Typically you have multiple data frames, and you must join them together to answer the questions that you’re interested in.

Let’s work through two kinds of joins: - Mutating joins, which add new variables to one data frame from matching observations in another. - Filtering joins, which filter observations from one data frame based on whether or not they match an observation in another.

We’ll first by talk about keys, the variables used to connect a pair of data frames in a join.

Then we’ll practice things by looking at the keys in the datasets from the nycflights13 package, and use that knowledge to start joining data frames together.

library(tidyverse)
library(nycflights13)
# If you haven't previously you'll need to run:
# install.packages("nycflights13")

Keys

Every join involves a pair of keys: a primary key and a foreign key.

A primary key is a variable or set of variables that uniquely identifies each observation.

When more than one variable is needed, the key is called a compound key. For example, in nycflights13:

airlines records two pieces of data about each airline: its carrier code and its full name. You can identify an airline with its two letter carrier code, making carrier the primary key.

airlines

## # A tibble: 16 × 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.

airports records data about each airport. You can identify each airport by its three letter airport code, making faa the primary key.

airports

## # A tibble: 1,458 × 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
## # ℹ 1,448 more rows

planes records data about each plane. You can identify a plane by its tail number, making tailnum the primary key.

planes

## # A tibble: 3,322 × 9
##    tailnum  year type              manufacturer model engines seats speed engine
##    <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  3 N103US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  5 N10575   2002 Fixed wing multi… EMBRAER      EMB-…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  7 N107US   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed wing multi… AIRBUS INDU… A320…       2   182    NA Turbo…
## # ℹ 3,312 more rows

weather records data about the weather at the origin airports. You can identify each observation by the combination of location and time, making origin and time_hour the compound primary key.

weather

## # A tibble: 26,115 × 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # ℹ 26,105 more rows
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>

A foreign key is a variable (or set of variables) that corresponds to a primary key in another table. For example: - flights$tailnum is a foreign key that corresponds to the primary key planes$tailnum. - flights$carrier is a foreign key that corresponds to the primary key airlines$carrier. - flights$origin is a foreign key that corresponds to the primary key airports$faa. - flights$dest is a foreign key that corresponds to the primary key airports$faa. - flights$origin-flights$time_hour is a compound foreign key that corresponds to the compound primary key weather$origin-weather$time_hour.

Connections between all five data frames in the nycflights13 package. Variables making up a primary key are colored grey, and are connected to their corresponding foreign keys with arrows.

You’ll notice a nice feature in the design of these particular keys: the primary and foreign keys almost always have the same names, which, as you’ll see shortly, will make your joining life much easier. It’s also worth noting the opposite relationship: almost every variable name used in multiple tables has the same meaning in each place. There’s only one exception: year means year of departure in flights and year manufactured in planes. This will become important when we start actually joining tables together.

Checking primary keys

Now that that we’ve identified the primary keys in each table, it’s good practice to verify that they do indeed uniquely identify each observation. One way to do that is to count() the primary keys and look for entries where n is greater than one. This reveals that planes and weather both look good:

planes %>% 
  count(tailnum) %>%
  filter(n > 1)

## # A tibble: 0 × 2
## # ℹ 2 variables: tailnum <chr>, n <int>

weather %>% 
  count(time_hour, origin) %>%
  filter(n > 1)

## # A tibble: 0 × 3
## # ℹ 3 variables: time_hour <dttm>, origin <chr>, n <int>

Basic Joins

Now that we understand how data frames are connected via keys, we can start using joins to better understand the flights dataset. dplyr provides six join functions: - left_join() - inner_join() - right_join() - full_join() - semi_join()

We’ll cover these in turn.

Mutating joins

A mutating join allows you to combine variables from two data frames: it first matches observations by their keys, then copies across variables from one data frame to the other. Like mutate(), the join functions add variables to the right, so if your dataset has many variables, you won’t see the new ones…. For these examples, we’ll make it easier to see what’s going on by creating a narrower dataset with just six variables:

flights.simple <- flights %>% 
  select(year, time_hour, origin, dest, tailnum, carrier)
flights.simple

## # A tibble: 336,776 × 6
##     year time_hour           origin dest  tailnum carrier
##    <int> <dttm>              <chr>  <chr> <chr>   <chr>  
##  1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA     
##  2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA     
##  3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA     
##  4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6     
##  5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL     
##  6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA     
##  7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6     
##  8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV     
##  9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6     
## 10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA     
## # ℹ 336,766 more rows

There are technically four types of mutating join, but there’s really just one that you’ll use almost all of the time: left_join(). It’s special because the output will always have the same rows as the data frame you’re joining to.

The primary use of left_join() is to add in additional metadata. For example, we can use left_join() to add the full airline name to the flights.simple data:

flights.simple %>%
  left_join(airlines)

## Joining with `by = join_by(carrier)`

## # A tibble: 336,776 × 7
##     year time_hour           origin dest  tailnum carrier name                  
##    <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>                 
##  1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      United Air Lines Inc. 
##  2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      United Air Lines Inc. 
##  3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      American Airlines Inc.
##  4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      JetBlue Airways       
##  5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Delta Air Lines Inc.  
##  6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      United Air Lines Inc. 
##  7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      JetBlue Airways       
##  8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      ExpressJet Airlines I…
##  9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      JetBlue Airways       
## 10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      American Airlines Inc.
## # ℹ 336,766 more rows

Or we could find out the temperature and wind speed when each plane departed:

flights.simple %>%
    left_join(weather %>% select(origin, time_hour, temp, wind_speed))

## Joining with `by = join_by(time_hour, origin)`

## # A tibble: 336,776 × 8
##     year time_hour           origin dest  tailnum carrier  temp wind_speed
##    <int> <dttm>              <chr>  <chr> <chr>   <chr>   <dbl>      <dbl>
##  1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA       39.0       12.7
##  2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA       39.9       15.0
##  3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA       39.0       15.0
##  4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6       39.0       15.0
##  5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL       39.9       16.1
##  6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA       39.0       12.7
##  7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6       37.9       11.5
##  8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV       39.9       16.1
##  9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6       37.9       13.8
## 10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA       39.9       16.1
## # ℹ 336,766 more rows

Or what size of plane was flying:

flights.simple %>%
    left_join(planes %>% select(tailnum, type, engines, seats))

## Joining with `by = join_by(tailnum)`

## # A tibble: 336,776 × 9
##     year time_hour           origin dest  tailnum carrier type     engines seats
##    <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>      <int> <int>
##  1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      Fixed w…       2   149
##  2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      Fixed w…       2   149
##  3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Fixed w…       2   178
##  4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      Fixed w…       2   200
##  5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Fixed w…       2   178
##  6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Fixed w…       2   191
##  7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      Fixed w…       2   200
##  8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      Fixed w…       2    55
##  9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      Fixed w…       2   200
## 10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      <NA>          NA    NA
## # ℹ 336,766 more rows

If/when left_join() fails to find a match for a row in x (the data frame you’re joining to), it fills in the new variables with missing values.

For example, there’s no information about the plane with tail number N3ALAA so the type, engines, and seats will be missing:

flights.simple %>%
  filter(tailnum == "N3ALAA") %>% 
  left_join(planes %>% select(tailnum, type, engines, seats))

## Joining with `by = join_by(tailnum)`

## # A tibble: 63 × 9
##     year time_hour           origin dest  tailnum carrier type  engines seats
##    <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>   <int> <int>
##  1  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
##  2  2013 2013-01-02 18:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
##  3  2013 2013-01-03 06:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
##  4  2013 2013-01-07 19:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
##  5  2013 2013-01-08 17:00:00 JFK    ORD   N3ALAA  AA      <NA>       NA    NA
##  6  2013 2013-01-16 06:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
##  7  2013 2013-01-20 18:00:00 LGA    ORD   N3ALAA  AA      <NA>       NA    NA
##  8  2013 2013-01-22 17:00:00 JFK    ORD   N3ALAA  AA      <NA>       NA    NA
##  9  2013 2013-10-11 06:00:00 EWR    MIA   N3ALAA  AA      <NA>       NA    NA
## 10  2013 2013-10-14 08:00:00 JFK    BOS   N3ALAA  AA      <NA>       NA    NA
## # ℹ 53 more rows

(This is something to be careful with)

Specifiying join keys

By default, left_join() will use all variables that appear in both data frames as the join key, the so called natural join. This is a useful heuristic, but it doesn’t always work. For example, what happens if we try to join flights.simple with the complete planes dataset?

flights.simple %>%
  left_join(planes)

## Joining with `by = join_by(year, tailnum)`

## # A tibble: 336,776 × 13
##     year time_hour           origin dest  tailnum carrier type  manufacturer
##    <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr> <chr>       
##  1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      <NA>  <NA>        
##  2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      <NA>  <NA>        
##  3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      <NA>  <NA>        
##  4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>  <NA>        
##  5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      <NA>  <NA>        
##  6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      <NA>  <NA>        
##  7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      <NA>  <NA>        
##  8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      <NA>  <NA>        
##  9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      <NA>  <NA>        
## 10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      <NA>  <NA>        
## # ℹ 336,766 more rows
## # ℹ 5 more variables: model <chr>, engines <int>, seats <int>, speed <int>,
## #   engine <chr>

We get a lot of missing matches because our join is trying to use tailnum and year as a compound key. Both flights and planes have a year column but they mean different things: flights$year is the year the flight occurred and planes$year is the year the plane was built.

We only want to join on tailnum so we need to provide an explicit specification with join_by():

flights.simple %>% 
  left_join(planes, join_by(tailnum))

## # A tibble: 336,776 × 14
##    year.x time_hour           origin dest  tailnum carrier year.y type          
##     <int> <dttm>              <chr>  <chr> <chr>   <chr>    <int> <chr>         
##  1   2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA        1999 Fixed wing mu…
##  2   2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA        1998 Fixed wing mu…
##  3   2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA        1990 Fixed wing mu…
##  4   2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6        2012 Fixed wing mu…
##  5   2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL        1991 Fixed wing mu…
##  6   2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA        2012 Fixed wing mu…
##  7   2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6        2000 Fixed wing mu…
##  8   2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV        1998 Fixed wing mu…
##  9   2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6        2004 Fixed wing mu…
## 10   2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA          NA <NA>          
## # ℹ 336,766 more rows
## # ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
## #   seats <int>, speed <int>, engine <chr>

Note: the year variables get disambiguated in the output with a suffix (year.x and year.y), which tells you whether the variable came from the x (joining to) or y (joining from) argument. You can override the default suffixes with the suffix argument.

Under the hood join_by(tailnum) is really shorthand for join_by(tailnum == tailnum). It’s important to know about this fuller form for a couple reasons. - First, it describes the relationship between the two tables: the keys must be equal. That’s why this type of join is sometimes called an equi join. - Second, it’s how you specify different join keys in each table. For example, there are two ways to join the flights.simple and airports table: either by dest or origin:

flights.simple %>% 
  left_join(airports, join_by(dest == faa))

## # A tibble: 336,776 × 13
##     year time_hour           origin dest  tailnum carrier name         lat   lon
##    <int> <dttm>              <chr>  <chr> <chr>   <chr>   <chr>      <dbl> <dbl>
##  1  2013 2013-01-01 05:00:00 EWR    IAH   N14228  UA      George Bu…  30.0 -95.3
##  2  2013 2013-01-01 05:00:00 LGA    IAH   N24211  UA      George Bu…  30.0 -95.3
##  3  2013 2013-01-01 05:00:00 JFK    MIA   N619AA  AA      Miami Intl  25.8 -80.3
##  4  2013 2013-01-01 05:00:00 JFK    BQN   N804JB  B6      <NA>        NA    NA  
##  5  2013 2013-01-01 06:00:00 LGA    ATL   N668DN  DL      Hartsfiel…  33.6 -84.4
##  6  2013 2013-01-01 05:00:00 EWR    ORD   N39463  UA      Chicago O…  42.0 -87.9
##  7  2013 2013-01-01 06:00:00 EWR    FLL   N516JB  B6      Fort Laud…  26.1 -80.2
##  8  2013 2013-01-01 06:00:00 LGA    IAD   N829AS  EV      Washingto…  38.9 -77.5
##  9  2013 2013-01-01 06:00:00 JFK    MCO   N593JB  B6      Orlando I…  28.4 -81.3
## 10  2013 2013-01-01 06:00:00 LGA    ORD   N3ALAA  AA      Chicago O…  42.0 -87.9
## # ℹ 336,766 more rows
## # ℹ 4 more variables: alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>

The other three mutating joins (inner_join(), right_join(), full_join()) all have the same interface as left_join(). The difference is which rows they keep: left join keeps all the rows in x, the right join keeps all rows in y, the full join keeps all rows in either x or y, and the inner join only keeps rows that occur in both x and y. We’ll come back to these in more detail later.

Filtering joins

As you might guess from the name the primary action of a filtering join is to filter the rows. There are two types: semi-joins and anti-joins.

Semi-joins keep all rows in x that have a match in y.

For example, we could use a semi-join to filter the airports dataset to show just the origin airports:

airports %>% 
  semi_join(flights.simple, join_by(faa == origin))

## # A tibble: 3 × 8
##   faa   name                  lat   lon   alt    tz dst   tzone           
##   <chr> <chr>               <dbl> <dbl> <dbl> <dbl> <chr> <chr>           
## 1 EWR   Newark Liberty Intl  40.7 -74.2    18    -5 A     America/New_York
## 2 JFK   John F Kennedy Intl  40.6 -73.8    13    -5 A     America/New_York
## 3 LGA   La Guardia           40.8 -73.9    22    -5 A     America/New_York

Or perhaps just the destinations:

airports %>% 
  semi_join(flights.simple, join_by(faa == dest))

## # A tibble: 101 × 8
##    faa   name                                lat    lon   alt    tz dst   tzone 
##    <chr> <chr>                             <dbl>  <dbl> <dbl> <dbl> <chr> <chr> 
##  1 ABQ   Albuquerque International Sunport  35.0 -107.   5355    -7 A     Ameri…
##  2 ACK   Nantucket Mem                      41.3  -70.1    48    -5 A     Ameri…
##  3 ALB   Albany Intl                        42.7  -73.8   285    -5 A     Ameri…
##  4 ANC   Ted Stevens Anchorage Intl         61.2 -150.    152    -9 A     Ameri…
##  5 ATL   Hartsfield Jackson Atlanta Intl    33.6  -84.4  1026    -5 A     Ameri…
##  6 AUS   Austin Bergstrom Intl              30.2  -97.7   542    -6 A     Ameri…
##  7 AVL   Asheville Regional Airport         35.4  -82.5  2165    -5 A     Ameri…
##  8 BDL   Bradley Intl                       41.9  -72.7   173    -5 A     Ameri…
##  9 BGR   Bangor Intl                        44.8  -68.8   192    -5 A     Ameri…
## 10 BHM   Birmingham Intl                    33.6  -86.8   644    -6 A     Ameri…
## # ℹ 91 more rows

Anti-joins are the opposite: they return all rows in x that don’t have a match in y. These are useful for finding missing values that are implicit in the data (since implicitly missing values don’t show up as NAs but instead only exist as an absence)

For example, we can find rows that are missing from airports by looking for flights that don’t have a matching destination airport:

flights.simple %>% 
  anti_join(airports, join_by(dest == faa)) %>% 
  distinct(dest)

## # A tibble: 4 × 1
##   dest 
##   <chr>
## 1 BQN  
## 2 SJU  
## 3 STT  
## 4 PSE

What’s next

Plenty more we could talk about but let’s leave pivoting (and visualizing joins) for next time…

Have a nice weekend!

P6 (Data wrangling – Part II)

Spencer Caplan

2025-10-23