We said last time that we’d work on joining dfs and pivots (longer vs. wider)… and we will… but first some review from HW2
After you read in the data, you should filter to include only the trials from “Experiment 2”
Print out the number of rows you get in the original data and then in the Exp 2 only data (the latter should have 24948) trials
CHT.all.df <- read_tsv("https://spencercaplan.org/teaching/F25/Statistics/data/CHT_data_source.tsv", show_col_types = FALSE)
str(CHT.all.df)
## spc_tbl_ [43,416 × 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
## $ uniqueid : chr [1:43416] "debug09STXB:debugACSEQ7" "debug09STXB:debugACSEQ7" "debug09STXB:debugACSEQ7" "debug09STXB:debugACSEQ7" ...
## $ VOT : num [1:43416] 20 60 40 55 50 60 30 70 40 30 ...
## $ blockTrialNum : num [1:43416] 1 2 3 4 5 6 7 8 22 23 ...
## $ condition : num [1:43416] 0 0 0 0 0 0 0 0 0 0 ...
## $ index : num [1:43416] 1 1 2 1 2 2 1 1 1 1 ...
## $ rt : num [1:43416] 1320 682 698 568 414 ...
## $ trialNum : num [1:43416] 143 144 145 146 147 148 149 150 164 165 ...
## $ experimentName : chr [1:43416] "CHT Exp1 (in person)" "CHT Exp1 (in person)" "CHT Exp1 (in person)" "CHT Exp1 (in person)" ...
## $ contextCondition: chr [1:43416] "L" "L" "L" "L" ...
## $ ambigPhoneme : chr [1:43416] "t" "t" "t" "t" ...
## $ t_choice : num [1:43416] 0 1 1 1 1 1 0 1 1 0 ...
## $ blockNum : num [1:43416] 1 1 1 1 1 1 1 1 2 2 ...
## - attr(*, "spec")=
## .. cols(
## .. uniqueid = col_character(),
## .. VOT = col_double(),
## .. blockTrialNum = col_double(),
## .. condition = col_double(),
## .. index = col_double(),
## .. rt = col_double(),
## .. trialNum = col_double(),
## .. experimentName = col_character(),
## .. contextCondition = col_character(),
## .. ambigPhoneme = col_character(),
## .. t_choice = col_double(),
## .. blockNum = col_double()
## .. )
## - attr(*, "problems")=<externalptr>
CHT.exp2.df <- CHT.all.df %>% filter(experimentName == "CHT Exp2 (online)")
nrow(CHT.all.df)
## [1] 43416
nrow(CHT.exp2.df)
## [1] 24948
CHT.exp2.df$uniqueid <- as.factor(CHT.exp2.df$uniqueid)
summary(CHT.exp2.df)
## uniqueid VOT
## A10JXOU89D5RXR:36W0OB37HWEXSVOLO8IISUGDNCTZHJ: 162 Min. :20
## A114BI2ZRYWKC0:3OCHAWUVGOKZPQPETBXU81GU7TMKXY: 162 1st Qu.:40
## A11Q8U6QTT8KGF:3U088ZLJVKTIN0DKFDRQNYNEKSO0WR: 162 Median :50
## A125AOX978LDG7:37U1UTWH9VMVXT11BNUZTELF879R8F: 162 Mean :50
## A12E4Z8DBFUL29:324G5B4FB383XLCJ75JEVIOXOYV07W: 162 3rd Qu.:60
## A1561P9VVA3C1C:382M9COHEHF4MM39SKB4QZ4LSAGEUE: 162 Max. :80
## (Other) :23976
## blockTrialNum condition index rt
## Min. : 1.0 Min. :0.000 Min. :1.0 Min. : 0.0
## 1st Qu.: 41.0 1st Qu.:0.000 1st Qu.:1.0 1st Qu.: 174.9
## Median : 81.5 Median :1.000 Median :1.5 Median : 278.9
## Mean : 81.5 Mean :1.831 Mean :1.5 Mean : 396.2
## 3rd Qu.:122.0 3rd Qu.:3.000 3rd Qu.:2.0 3rd Qu.: 421.8
## Max. :162.0 Max. :4.000 Max. :2.0 Max. :154615.2
##
## trialNum experimentName contextCondition ambigPhoneme
## Min. :143.0 Length:24948 Length:24948 Length:24948
## 1st Qu.:183.0 Class :character Class :character Class :character
## Median :223.5 Mode :character Mode :character Mode :character
## Mean :223.5
## 3rd Qu.:264.0
## Max. :304.0
##
## t_choice blockNum
## Min. :0.0000 Min. :1
## 1st Qu.:0.0000 1st Qu.:3
## Median :1.0000 Median :5
## Mean :0.6193 Mean :5
## 3rd Qu.:1.0000 3rd Qu.:7
## Max. :1.0000 Max. :9
##
# Or more specifically
is.factor(CHT.exp2.df$uniqueid)
## [1] TRUE
counts <- CHT.exp2.df %>% distinct(uniqueid, contextCondition, ambigPhoneme) %>%
group_by(contextCondition, ambigPhoneme) %>%
summarise(n_subjects = n())
## `summarise()` has grouped output by 'contextCondition'. You can override using
## the `.groups` argument.
kable(counts)
| contextCondition | ambigPhoneme | n_subjects |
|---|---|---|
| L | d | 39 |
| L | t | 46 |
| R | d | 36 |
| R | t | 33 |
sum(counts$n_subjects)
## [1] 154
group.level.before <- CHT.exp2.df %>%
filter(contextCondition == 'L') %>%
group_by(ambigPhoneme, VOT) %>%
summarize(n = n(),
n_t = sum(t_choice),
p = mean(t_choice),
se = sqrt(p * (1 - p) / n), # , # standard error
ci_low = p - 1.96 * se, # 95% lower bound
ci_high = p + 1.96 * se # 95% upper bound
)
## `summarise()` has grouped output by 'ambigPhoneme'. You can override using the
## `.groups` argument.
group.level.after <- CHT.exp2.df %>%
filter(contextCondition == 'R') %>%
group_by(ambigPhoneme, VOT) %>%
summarize(n = n(),
n_t = sum(t_choice),
p = mean(t_choice),
se = sqrt(p * (1 - p) / n), # , # standard error
ci_low = p - 1.96 * se, # 95% lower bound
ci_high = p + 1.96 * se # 95% upper bound
)
## `summarise()` has grouped output by 'ambigPhoneme'. You can override using the
## `.groups` argument.
As a bonus since we’re re-using code above it might be nice to pull
this into a separate helper function which we call for both
L and R conditions
Though I’m not expecting you to write your own functions like this…
means_and_CIs_for_plotting_CHT <- function(data) {
data %>%
group_by(ambigPhoneme, VOT) %>%
summarize(n = n(),
n_t = sum(t_choice),
p = mean(t_choice),
se = sqrt(p * (1 - p) / n), # , # standard error
ci_low = p - 1.96 * se, # 95% lower bound
ci_high = p + 1.96 * se # 95% upper bound
) -> processed.df
return(processed.df)
}
group.level.before <- means_and_CIs_for_plotting_CHT(CHT.exp2.df %>%
filter(contextCondition == 'L'))
## `summarise()` has grouped output by 'ambigPhoneme'. You can override using the
## `.groups` argument.
group.level.after <- means_and_CIs_for_plotting_CHT(CHT.exp2.df %>%
filter(contextCondition == 'R'))
## `summarise()` has grouped output by 'ambigPhoneme'. You can override using the
## `.groups` argument.
The above two code blocks will do the same thing. Let’s view the results below:
kable(group.level.before)
| ambigPhoneme | VOT | n | n_t | p | se | ci_low | ci_high |
|---|---|---|---|---|---|---|---|
| d | 20 | 702 | 10 | 0.0142450 | 0.0044725 | 0.0054790 | 0.0230111 |
| d | 30 | 702 | 51 | 0.0726496 | 0.0097965 | 0.0534485 | 0.0918507 |
| d | 40 | 702 | 198 | 0.2820513 | 0.0169841 | 0.2487625 | 0.3153401 |
| d | 45 | 702 | 345 | 0.4914530 | 0.0188685 | 0.4544707 | 0.5284353 |
| d | 50 | 702 | 492 | 0.7008547 | 0.0172817 | 0.6669826 | 0.7347268 |
| d | 55 | 702 | 583 | 0.8304843 | 0.0141613 | 0.8027282 | 0.8582404 |
| d | 60 | 702 | 635 | 0.9045584 | 0.0110897 | 0.8828227 | 0.9262942 |
| d | 70 | 702 | 680 | 0.9686610 | 0.0065760 | 0.9557721 | 0.9815499 |
| d | 80 | 702 | 695 | 0.9900285 | 0.0037500 | 0.9826784 | 0.9973786 |
| t | 20 | 828 | 10 | 0.0120773 | 0.0037960 | 0.0046371 | 0.0195175 |
| t | 30 | 828 | 111 | 0.1340580 | 0.0118407 | 0.1108503 | 0.1572656 |
| t | 40 | 828 | 325 | 0.3925121 | 0.0169699 | 0.3592510 | 0.4257731 |
| t | 45 | 828 | 475 | 0.5736715 | 0.0171865 | 0.5399859 | 0.6073571 |
| t | 50 | 828 | 624 | 0.7536232 | 0.0149748 | 0.7242725 | 0.7829739 |
| t | 55 | 828 | 739 | 0.8925121 | 0.0107640 | 0.8714147 | 0.9136094 |
| t | 60 | 828 | 775 | 0.9359903 | 0.0085064 | 0.9193179 | 0.9526628 |
| t | 70 | 828 | 814 | 0.9830918 | 0.0044805 | 0.9743099 | 0.9918737 |
| t | 80 | 828 | 818 | 0.9879227 | 0.0037960 | 0.9804825 | 0.9953629 |
kable(group.level.after)
| ambigPhoneme | VOT | n | n_t | p | se | ci_low | ci_high |
|---|---|---|---|---|---|---|---|
| d | 20 | 648 | 9 | 0.0138889 | 0.0045974 | 0.0048780 | 0.0228997 |
| d | 30 | 648 | 90 | 0.1388889 | 0.0135855 | 0.1122613 | 0.1655165 |
| d | 40 | 648 | 243 | 0.3750000 | 0.0190181 | 0.3377244 | 0.4122756 |
| d | 45 | 648 | 397 | 0.6126543 | 0.0191368 | 0.5751462 | 0.6501625 |
| d | 50 | 648 | 496 | 0.7654321 | 0.0166456 | 0.7328067 | 0.7980575 |
| d | 55 | 648 | 581 | 0.8966049 | 0.0119609 | 0.8731616 | 0.9200483 |
| d | 60 | 648 | 615 | 0.9490741 | 0.0086364 | 0.9321468 | 0.9660014 |
| d | 70 | 648 | 636 | 0.9814815 | 0.0052961 | 0.9711011 | 0.9918618 |
| d | 80 | 648 | 640 | 0.9876543 | 0.0043378 | 0.9791522 | 0.9961565 |
| t | 20 | 594 | 12 | 0.0202020 | 0.0057726 | 0.0088877 | 0.0315163 |
| t | 30 | 594 | 85 | 0.1430976 | 0.0143677 | 0.1149369 | 0.1712584 |
| t | 40 | 594 | 223 | 0.3754209 | 0.0198683 | 0.3364791 | 0.4143627 |
| t | 45 | 594 | 356 | 0.5993266 | 0.0201064 | 0.5599181 | 0.6387351 |
| t | 50 | 594 | 456 | 0.7676768 | 0.0173278 | 0.7337144 | 0.8016392 |
| t | 55 | 594 | 516 | 0.8686869 | 0.0138577 | 0.8415257 | 0.8958480 |
| t | 60 | 594 | 556 | 0.9360269 | 0.0100404 | 0.9163478 | 0.9557061 |
| t | 70 | 594 | 576 | 0.9696970 | 0.0070334 | 0.9559114 | 0.9834825 |
| t | 80 | 594 | 584 | 0.9831650 | 0.0052787 | 0.9728187 | 0.9935112 |
CHT.before.pl <- ggplot(group.level.before,
aes(x=VOT, y=p, color=ambigPhoneme, group=ambigPhoneme)) +
geom_hline(yintercept=0,colour = "black",linewidth=1) +
geom_line(linewidth=3) +
geom_errorbar(width=0.05, size=3, aes(ymin=ci_low, ymax=ci_high)) +
geom_point(pch=21, fill='white', size = 3) +
ylab('Proportion /t/ choice') +
xlab('VOT (ms)') +
ggtitle('CHT Text-Before') +
FormSense_theme() +
scale_fill_manual(values=c("#5e3c99", "#e66101"), aesthetics = c("color", "fill"), name = "Shifted\nPhone")
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## Warning: A numeric `legend.position` argument in `theme()` was deprecated in ggplot2
## 3.5.0.
## ℹ Please use the `legend.position.inside` argument of `theme()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
CHT.before.pl
# ggsave(plot = CHT.before.pl, "HW2-CHT-before.png",
# width = 12, height = 8, units = "in")
CHT.after.pl <- ggplot(group.level.after,
aes(x=VOT, y=p, color=ambigPhoneme, group=ambigPhoneme)) +
geom_hline(yintercept=0,colour = "black",linewidth=1) +
geom_line(linewidth=3) +
geom_errorbar(width=0.05, size=3, aes(ymin=ci_low, ymax=ci_high)) +
geom_point(shape=21, fill='white', size = 3) +
ylab('Proportion /t/ choice') +
xlab('VOT (ms)') +
ggtitle('CHT Text-After') +
FormSense_theme() +
scale_fill_manual(values=c("#5e3c99", "#e66101"), aesthetics = c("color", "fill"), name = "Shifted\nPhone")
CHT.after.pl
# ggsave(plot = CHT.after.pl, "HW2-CHT-after.png",
# width = 12, height = 8, units = "in")
We hadn’t covered it yet at the time the assignment was released, but
now that we know how to use facet_wrap we can actually
produce the combined plot in one go!
group.level.all <- CHT.exp2.df %>%
group_by(ambigPhoneme, contextCondition, VOT) %>%
summarize(n = n(),
n_t = sum(t_choice),
p = mean(t_choice),
se = sqrt(p * (1 - p) / n), # , # standard error
ci_low = p - 1.96 * se, # 95% lower bound
ci_high = p + 1.96 * se # 95% upper bound
)
## `summarise()` has grouped output by 'ambigPhoneme', 'contextCondition'. You can
## override using the `.groups` argument.
CHT.sidebyside.pl <- ggplot(group.level.all,
aes(x=VOT, y=p, color=ambigPhoneme, group=ambigPhoneme)) +
geom_hline(yintercept=0,colour = "black",linewidth=1) +
geom_line(linewidth=3) +
geom_errorbar(width=0.05, size=3, aes(ymin=ci_low, ymax=ci_high)) +
geom_point(pch=21, fill='white', size = 3) +
facet_wrap(~contextCondition,
labeller = labeller(contextCondition = c(
"L" = "Text-Before",
"R" = "Text-After"
))) +
ylab('Proportion /t/ choice') +
xlab('VOT (ms)') +
ggtitle('CHT Main Effect') +
FormSense_theme() +
scale_fill_manual(values=c("#5e3c99", "#e66101"), aesthetics = c("color", "fill"), name = "Shifted\nPhone")
CHT.sidebyside.pl
# ggsave(plot = CHT.before.pl, "HW2-CHT-before.png",
# width = 12, height = 8, units = "in")
Here are the numbers from the classic Pisoni & Tash (1974) paper, so first let’s see how we’d plot them
#### Pisoni & Tash (1974) plot
PT.extracted.data <- data.frame(
VOT = c(0, 10, 20, 30, 40, 50, 60),
rt = c(500, 494, 544, 580, 478, 481, 482)
)
PT.extracted.data <- PT.extracted.data %>%
mutate(logRT = log(rt)) %>%
mutate(zRT = (logRT - mean(logRT)) / sd(logRT))
pisoni_style_plot <- ggplot(PT.extracted.data, aes(x = VOT, y = rt)) +
geom_line(linetype = "dashed", color = "black", linewidth = 2) +
geom_point(size = 5, color = "black", shape = 24, fill = "black") +
labs(
title = "Replotting Empirical RTs\nfrom Pisoni & Tash (1974)",
x = "Raw VOT",
y = "Mean RT (ms)"
) +
scale_x_continuous(breaks = seq(from = 0,
to = 60,
by = 10)) +
geom_vline(xintercept = 30, color = "steelblue", linetype = "dashed", linewidth = 1) +
single_pane_theme() +
scale_y_continuous(limits = c(450, 600))
pisoni_style_plot
Let’s apply that basic outline here.
CHT.rts.to.plot <- CHT.exp2.df %>%
group_by(VOT) %>%
summarise(meanRT = mean(rt), .groups = "drop")
pl.messy <- ggplot(data = CHT.rts.to.plot, aes(x = VOT, y = meanRT)) +
geom_line(linetype = "dashed", color = "black", linewidth = 2) +
geom_point(size = 5, color = "black", shape = 24, fill = "black") +
labs(title = "Empirical RTs from CHT\nas a function of VOT level",
y = "Raw RT",
x = "Raw VOT",
color = "Experiment / Condition") +
scale_x_continuous(breaks = seq(from = 10,
to = 85,
by = 15)) +
scale_color_viridis_d() +
single_pane_theme() +
theme(strip.text = element_text(size = axisTextSize, face = "bold"))
pl.messy
Cool!
This plot is just means, but we really would like some confidence intervals on each point to see visually whether this seems robust given the spread of the distribution.
Let’s try to do that together!
Hint: - We’re dealing with continuous rather than binomial trials
now,… - Thus we can probably model this via a t-distribution - Recall
the qt() function for calculating the qualtile function
over the t-distribution
(In the markdown you’ll see there is a code cell that’s hidden DO NOT EXPAND this until you first try figuring it out yourself / collaboratively)
CHT.rts.to.plot <- CHT.exp2.df %>%
group_by(VOT) %>%
summarise(n = n(),
meanRT = mean(rt),
se = sd(rt) / sqrt(n), # , # standard error
ci_low = meanRT - qt(0.975, df = n - 1) * se, # 95% lower bound
ci_high = meanRT + qt(0.975, df = n - 1) * se # 95% upper bound
, .groups = "drop"
)
pl.messy.withCIs <- ggplot(data = CHT.rts.to.plot, aes(x = VOT, y = meanRT)) +
geom_line(linetype = "dashed", color = "black", linewidth = 2) +
geom_point(size = 5, color = "black", shape = 24, fill = "black") +
geom_errorbar(width=0.05, size=3, aes(ymin=ci_low, ymax=ci_high)) +
labs(title = "Empirical RTs from CHT\nas a function of VOT level",
y = "Raw RT",
x = "Raw VOT",
color = "Experiment / Condition") +
scale_x_continuous(breaks = seq(from = 10,
to = 85,
by = 15)) +
scale_color_viridis_d() +
single_pane_theme() +
theme(strip.text = element_text(size = axisTextSize, face = "bold"))
pl.messy.withCIs
Woah! Looks like there’s a huge spread over the RTs specifically at
the VOT=40ms level.
Let’s see what could be going on:
## Seems like the spread is very wide at VOT=40
## Maybe worth exploring more:
# We'll filter to just the trials we want
CHT.exp2.df.VOT40 <- CHT.exp2.df %>%
filter(VOT == 40) %>%
select(c(uniqueid, VOT, rt))
## Let's try plotting the distribution at the trial level:
ggplot(CHT.exp2.df.VOT40, aes(x = rt)) +
geom_density(fill = "skyblue", alpha = 0.5)
Aha, seems like the culprit must be some really long trial where the
participant never hit next, or perhaps this was misrecorded by the
experimental software. Either way we should filter it out to see what
the distribution looks like without this outlier:
## First confirming the outlier by checking the distribution
summary(CHT.exp2.df.VOT40$rt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.33 182.00 296.02 465.93 470.00 154615.22
# Min. 1st Qu. Median Mean 3rd Qu. Max.
# 1.33 182.00 296.02 465.93 470.00 154615.22
# Distribution looks like we'd expect except for the Max value (even the third quartile isn't so extreme)
CHT.exp2.df.VOT40.nooutlier <- CHT.exp2.df.VOT40 %>% filter(rt < 1000)
summary(CHT.exp2.df.VOT40.nooutlier$rt)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.325 174.025 283.567 319.380 418.226 997.220
ggplot(CHT.exp2.df.VOT40.nooutlier, aes(x = rt)) +
geom_density(fill = "skyblue", alpha = 0.5)
That looks more reasonable. Let’s repeat the above workflow adding CIs
but after removing these outliers.
CHT.rts.no.outliers.to.plot <- CHT.exp2.df %>%
filter(rt < 1000) %>%
group_by(VOT) %>%
summarise(n = n(),
meanRT = mean(rt),
se = sd(rt) / sqrt(n), # , # standard error
ci_low = meanRT - qt(0.975, df = n - 1) * se, # 95% lower bound
ci_high = meanRT + qt(0.975, df = n - 1) * se # 95% upper bound
, .groups = "drop"
)
pl.messy.withCIs.no.outliers <- ggplot(data = CHT.rts.no.outliers.to.plot, aes(x = VOT, y = meanRT)) +
geom_line(linetype = "dashed", color = "black", linewidth = 2) +
geom_point(size = 5, color = "black", shape = 24, fill = "black") +
geom_errorbar(width=0.05, size=3, aes(ymin=ci_low, ymax=ci_high)) +
labs(title = "Empirical RTs from CHT\nas a function of VOT level",
y = "Raw RT",
x = "Raw VOT",
color = "Experiment / Condition") +
scale_x_continuous(breaks = seq(from = 10,
to = 85,
by = 15)) +
scale_color_viridis_d() +
single_pane_theme() +
theme(strip.text = element_text(size = axisTextSize, face = "bold"))
pl.messy.withCIs.no.outliers
The other reason I described these results as “cool” is because: you’ll note that the original experiment wasn’t designed with processing speed in mind, nor was it a variable that was manipulated directly. Nevertheless, we see that the trend from Pisoni & Tash (1974) is replicated: participants are slowest to respect when judging input near the category boundary and faster to respond as the phonetic input moves away from the category boundary (in either direction).
Could we test this statistically? Let’s work on this together!
TRY TO WORK OUT TOGETHER HOW TO DO THIS :)
Let’s see what else is going on, if the slowdown is really based on the category boundary then we should see the RT peak move around as a result of perceptual learning for the participants whose categorization boundaries shifted.
Does that happen?
First let’s add a column to indicate which subjects had a shifted category boundary or not (more specifically than the text-before group, it’s actually most apparent in the text-before shifted-/d/ participants)
Then we’ll calculate the mean RT per VOT level again.
CHT.all.df.shifted <- CHT.exp2.df %>%
mutate(shift_group = case_when(
contextCondition == "L" & ambigPhoneme == 'd' ~ "CHT right-shift",
TRUE ~ "CHT no-shift"
))
CHT.rts.to.plot <- CHT.all.df.shifted %>%
group_by(VOT, shift_group) %>%
summarise(n = n(),
meanRT = mean(rt),
se = sd(rt) / sqrt(n), # , # standard error
ci_low = meanRT - qt(0.975, df = n - 1) * se, # 95% lower bound
ci_high = meanRT + qt(0.975, df = n - 1) * se # 95% upper bound
, .groups = "drop"
)
pl.messy.by.group <- ggplot(data = CHT.rts.to.plot, aes(x = VOT, y = meanRT, group = shift_group, color = shift_group)) +
geom_line(linetype = "dashed", linewidth = 2) +
geom_point(size = 5, shape = 24, fill = "white") +
labs(title = "Empirical RTs from CHT",
y = "Raw RT",
x = "Raw VOT",
color = "Experiment / Condition") +
scale_x_continuous(breaks = seq(from = min(CHT.all.df.shifted$VOT),
to = max(CHT.all.df.shifted$VOT),
by = 15)) +
scale_color_viridis_d() +
single_pane_theme() +
theme(legend.position = "right") +
theme(strip.text = element_text(size = axisTextSize, face = "bold"))
pl.messy.by.group
There’s some overall difference in RTs by group… let’s z-score things so they’re more comparable
# Z-score the log-rts within each condition for comparability
CHT.all.df.shifted <- CHT.all.df.shifted %>% filter(rt > 0) %>%
group_by(shift_group) %>%
mutate(logRT = log(rt)) %>%
mutate(zRT = (logRT - mean(logRT)) / sd(logRT)) %>%
ungroup()
CHT.rts.normalized.to.plot <- CHT.all.df.shifted %>%
group_by(VOT, shift_group) %>%
summarise(n = n(),
meanRT = mean(zRT),
se = sd(zRT) / sqrt(n), # , # standard error
ci_low = meanRT - qt(0.975, df = n - 1) * se, # 95% lower bound
ci_high = meanRT + qt(0.975, df = n - 1) * se # 95% upper bound
, .groups = "drop"
)
pl.normed.by.group <- ggplot(data = CHT.rts.normalized.to.plot, aes(x = VOT, y = meanRT, group = shift_group, color = shift_group)) +
geom_line(linetype = "dashed", linewidth = 2) +
geom_point(size = 5, shape = 24, fill = "white") +
labs(title = "Empirical RTs from CHT",
y = "Raw RT",
x = "Raw VOT",
color = "Experiment / Condition") +
scale_x_continuous(breaks = seq(from = min(CHT.all.df.shifted$VOT),
to = max(CHT.all.df.shifted$VOT),
by = 15)) +
scale_color_viridis_d() +
single_pane_theme() +
theme(legend.position = "bottom") +
theme(strip.text = element_text(size = axisTextSize, face = "bold"))
pl.normed.by.group
Looks like it! (It’s just a boundary shift of about 5-10ms anyway…) But to really evaluate this difference we’d want to control for more factors… to be continued
In the meantime let’s look at some other data manipulations that will come in handy
It’s rare that a data analysis involves only a single data frame. Typically you have multiple data frames, and you must join them together to answer the questions that you’re interested in.
Let’s work through two kinds of joins: - Mutating joins, which add new variables to one data frame from matching observations in another. - Filtering joins, which filter observations from one data frame based on whether or not they match an observation in another.
We’ll first by talk about keys, the variables used to connect a pair of data frames in a join.
Then we’ll practice things by looking at the keys in the datasets
from the nycflights13 package, and use that knowledge to
start joining data frames together.
library(tidyverse)
library(nycflights13)
# If you haven't previously you'll need to run:
# install.packages("nycflights13")
Every join involves a pair of keys: a primary key and a foreign key.
A primary key is a variable or set of variables that uniquely identifies each observation.
When more than one variable is needed, the key is called a compound
key. For example, in nycflights13:
airlines records two pieces of data about each airline:
its carrier code and its full name. You can identify an airline with its
two letter carrier code, making carrier the primary
key.airlines
## # A tibble: 16 × 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
airports records data about each airport. You can
identify each airport by its three letter airport code, making
faa the primary key.airports
## # A tibble: 1,458 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
## # ℹ 1,448 more rows
planes records data about each plane. You can identify
a plane by its tail number, making tailnum the primary
key.planes
## # A tibble: 3,322 × 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
## 2 N102UW 1998 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 3 N103US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 4 N104UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 5 N10575 2002 Fixed wing multi… EMBRAER EMB-… 2 55 NA Turbo…
## 6 N105UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 7 N107US 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 8 N108UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 9 N109UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## 10 N110UW 1999 Fixed wing multi… AIRBUS INDU… A320… 2 182 NA Turbo…
## # ℹ 3,312 more rows
weather records data about the weather at the origin
airports. You can identify each observation by the combination of
location and time, making origin and time_hour
the compound primary key.weather
## # A tibble: 26,115 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
## # ℹ 26,105 more rows
## # ℹ 5 more variables: wind_gust <dbl>, precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>
A foreign key is a variable (or set of variables)
that corresponds to a primary key in another table. For example: -
flights$tailnum is a foreign key that corresponds to the
primary key planes$tailnum. - flights$carrier
is a foreign key that corresponds to the primary key
airlines$carrier. - flights$origin is a
foreign key that corresponds to the primary key
airports$faa. - flights$dest is a foreign key
that corresponds to the primary key airports$faa. -
flights$origin-flights$time_hour is a compound
foreign key that corresponds to the compound primary key
weather$origin-weather$time_hour.
You’ll notice a nice feature in the design of these particular keys:
the primary and foreign keys almost always have the same names, which,
as you’ll see shortly, will make your joining life much easier. It’s
also worth noting the opposite relationship: almost every variable name
used in multiple tables has the same meaning in each place. There’s only
one exception: year means year of departure in
flights and year manufactured in planes. This
will become important when we start actually joining tables
together.
Now that that we’ve identified the primary keys in each table, it’s
good practice to verify that they do indeed uniquely identify each
observation. One way to do that is to count() the primary
keys and look for entries where n is greater than one. This
reveals that planes and weather both look
good:
planes %>%
count(tailnum) %>%
filter(n > 1)
## # A tibble: 0 × 2
## # ℹ 2 variables: tailnum <chr>, n <int>
weather %>%
count(time_hour, origin) %>%
filter(n > 1)
## # A tibble: 0 × 3
## # ℹ 3 variables: time_hour <dttm>, origin <chr>, n <int>
Now that we understand how data frames are connected via
keys, we can start using joins to better understand the
flights dataset. dplyr provides six join functions: -
left_join() - inner_join() -
right_join() - full_join() -
semi_join()
We’ll cover these in turn.
A mutating join allows you to combine variables from
two data frames: it first matches observations by their keys, then
copies across variables from one data frame to the other. Like
mutate(), the join functions add variables to the right, so
if your dataset has many variables, you won’t see the new ones…. For
these examples, we’ll make it easier to see what’s going on by creating
a narrower dataset with just six variables:
flights.simple <- flights %>%
select(year, time_hour, origin, dest, tailnum, carrier)
flights.simple
## # A tibble: 336,776 × 6
## year time_hour origin dest tailnum carrier
## <int> <dttm> <chr> <chr> <chr> <chr>
## 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA
## 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA
## 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA
## 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6
## 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL
## 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA
## 7 2013 2013-01-01 06:00:00 EWR FLL N516JB B6
## 8 2013 2013-01-01 06:00:00 LGA IAD N829AS EV
## 9 2013 2013-01-01 06:00:00 JFK MCO N593JB B6
## 10 2013 2013-01-01 06:00:00 LGA ORD N3ALAA AA
## # ℹ 336,766 more rows
There are technically four types of mutating join, but there’s really
just one that you’ll use almost all of the time:
left_join(). It’s special because the output will always
have the same rows as the data frame you’re joining to.
The primary use of left_join() is to add in additional
metadata. For example, we can use left_join() to add the
full airline name to the flights.simple data:
flights.simple %>%
left_join(airlines)
## Joining with `by = join_by(carrier)`
## # A tibble: 336,776 × 7
## year time_hour origin dest tailnum carrier name
## <int> <dttm> <chr> <chr> <chr> <chr> <chr>
## 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA United Air Lines Inc.
## 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA United Air Lines Inc.
## 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA American Airlines Inc.
## 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 JetBlue Airways
## 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Delta Air Lines Inc.
## 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA United Air Lines Inc.
## 7 2013 2013-01-01 06:00:00 EWR FLL N516JB B6 JetBlue Airways
## 8 2013 2013-01-01 06:00:00 LGA IAD N829AS EV ExpressJet Airlines I…
## 9 2013 2013-01-01 06:00:00 JFK MCO N593JB B6 JetBlue Airways
## 10 2013 2013-01-01 06:00:00 LGA ORD N3ALAA AA American Airlines Inc.
## # ℹ 336,766 more rows
Or we could find out the temperature and wind speed when each plane departed:
flights.simple %>%
left_join(weather %>% select(origin, time_hour, temp, wind_speed))
## Joining with `by = join_by(time_hour, origin)`
## # A tibble: 336,776 × 8
## year time_hour origin dest tailnum carrier temp wind_speed
## <int> <dttm> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 39.0 12.7
## 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 39.9 15.0
## 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 39.0 15.0
## 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 39.0 15.0
## 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 39.9 16.1
## 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 39.0 12.7
## 7 2013 2013-01-01 06:00:00 EWR FLL N516JB B6 37.9 11.5
## 8 2013 2013-01-01 06:00:00 LGA IAD N829AS EV 39.9 16.1
## 9 2013 2013-01-01 06:00:00 JFK MCO N593JB B6 37.9 13.8
## 10 2013 2013-01-01 06:00:00 LGA ORD N3ALAA AA 39.9 16.1
## # ℹ 336,766 more rows
Or what size of plane was flying:
flights.simple %>%
left_join(planes %>% select(tailnum, type, engines, seats))
## Joining with `by = join_by(tailnum)`
## # A tibble: 336,776 × 9
## year time_hour origin dest tailnum carrier type engines seats
## <int> <dttm> <chr> <chr> <chr> <chr> <chr> <int> <int>
## 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA Fixed w… 2 149
## 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA Fixed w… 2 149
## 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Fixed w… 2 178
## 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 Fixed w… 2 200
## 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Fixed w… 2 178
## 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Fixed w… 2 191
## 7 2013 2013-01-01 06:00:00 EWR FLL N516JB B6 Fixed w… 2 200
## 8 2013 2013-01-01 06:00:00 LGA IAD N829AS EV Fixed w… 2 55
## 9 2013 2013-01-01 06:00:00 JFK MCO N593JB B6 Fixed w… 2 200
## 10 2013 2013-01-01 06:00:00 LGA ORD N3ALAA AA <NA> NA NA
## # ℹ 336,766 more rows
If/when left_join() fails to find a match for a row in
x (the data frame you’re joining to), it
fills in the new variables with missing values.
For example, there’s no information about the plane with tail number
N3ALAA so the type, engines, and
seats will be missing:
flights.simple %>%
filter(tailnum == "N3ALAA") %>%
left_join(planes %>% select(tailnum, type, engines, seats))
## Joining with `by = join_by(tailnum)`
## # A tibble: 63 × 9
## year time_hour origin dest tailnum carrier type engines seats
## <int> <dttm> <chr> <chr> <chr> <chr> <chr> <int> <int>
## 1 2013 2013-01-01 06:00:00 LGA ORD N3ALAA AA <NA> NA NA
## 2 2013 2013-01-02 18:00:00 LGA ORD N3ALAA AA <NA> NA NA
## 3 2013 2013-01-03 06:00:00 LGA ORD N3ALAA AA <NA> NA NA
## 4 2013 2013-01-07 19:00:00 LGA ORD N3ALAA AA <NA> NA NA
## 5 2013 2013-01-08 17:00:00 JFK ORD N3ALAA AA <NA> NA NA
## 6 2013 2013-01-16 06:00:00 LGA ORD N3ALAA AA <NA> NA NA
## 7 2013 2013-01-20 18:00:00 LGA ORD N3ALAA AA <NA> NA NA
## 8 2013 2013-01-22 17:00:00 JFK ORD N3ALAA AA <NA> NA NA
## 9 2013 2013-10-11 06:00:00 EWR MIA N3ALAA AA <NA> NA NA
## 10 2013 2013-10-14 08:00:00 JFK BOS N3ALAA AA <NA> NA NA
## # ℹ 53 more rows
(This is something to be careful with)
By default, left_join() will use all variables that
appear in both data frames as the join key, the so called
natural join. This is a useful heuristic, but it
doesn’t always work. For example, what happens if we try to join
flights.simple with the complete planes
dataset?
flights.simple %>%
left_join(planes)
## Joining with `by = join_by(year, tailnum)`
## # A tibble: 336,776 × 13
## year time_hour origin dest tailnum carrier type manufacturer
## <int> <dttm> <chr> <chr> <chr> <chr> <chr> <chr>
## 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA <NA> <NA>
## 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA <NA> <NA>
## 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA <NA> <NA>
## 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 <NA> <NA>
## 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL <NA> <NA>
## 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA <NA> <NA>
## 7 2013 2013-01-01 06:00:00 EWR FLL N516JB B6 <NA> <NA>
## 8 2013 2013-01-01 06:00:00 LGA IAD N829AS EV <NA> <NA>
## 9 2013 2013-01-01 06:00:00 JFK MCO N593JB B6 <NA> <NA>
## 10 2013 2013-01-01 06:00:00 LGA ORD N3ALAA AA <NA> <NA>
## # ℹ 336,766 more rows
## # ℹ 5 more variables: model <chr>, engines <int>, seats <int>, speed <int>,
## # engine <chr>
We get a lot of missing matches because our join is trying to use
tailnum and year as a compound key. Both
flights and planes have a year column but they
mean different things: flights$year is the year the flight
occurred and planes$year is the year the plane was
built.
We only want to join on tailnum so we need to provide an
explicit specification with join_by():
flights.simple %>%
left_join(planes, join_by(tailnum))
## # A tibble: 336,776 × 14
## year.x time_hour origin dest tailnum carrier year.y type
## <int> <dttm> <chr> <chr> <chr> <chr> <int> <chr>
## 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA 1999 Fixed wing mu…
## 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA 1998 Fixed wing mu…
## 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA 1990 Fixed wing mu…
## 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 2012 Fixed wing mu…
## 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL 1991 Fixed wing mu…
## 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA 2012 Fixed wing mu…
## 7 2013 2013-01-01 06:00:00 EWR FLL N516JB B6 2000 Fixed wing mu…
## 8 2013 2013-01-01 06:00:00 LGA IAD N829AS EV 1998 Fixed wing mu…
## 9 2013 2013-01-01 06:00:00 JFK MCO N593JB B6 2004 Fixed wing mu…
## 10 2013 2013-01-01 06:00:00 LGA ORD N3ALAA AA NA <NA>
## # ℹ 336,766 more rows
## # ℹ 6 more variables: manufacturer <chr>, model <chr>, engines <int>,
## # seats <int>, speed <int>, engine <chr>
Note: the year variables get disambiguated in the
output with a suffix (year.x and year.y),
which tells you whether the variable came from the x
(joining to) or y (joining from) argument. You can override
the default suffixes with the suffix argument.
Under the hood join_by(tailnum) is really shorthand for
join_by(tailnum == tailnum). It’s important to know about
this fuller form for a couple reasons. - First, it describes the
relationship between the two tables: the keys must be equal. That’s why
this type of join is sometimes called an equi join. -
Second, it’s how you specify different join keys in each table. For
example, there are two ways to join the flights.simple and
airports table: either by dest or
origin:
flights.simple %>%
left_join(airports, join_by(dest == faa))
## # A tibble: 336,776 × 13
## year time_hour origin dest tailnum carrier name lat lon
## <int> <dttm> <chr> <chr> <chr> <chr> <chr> <dbl> <dbl>
## 1 2013 2013-01-01 05:00:00 EWR IAH N14228 UA George Bu… 30.0 -95.3
## 2 2013 2013-01-01 05:00:00 LGA IAH N24211 UA George Bu… 30.0 -95.3
## 3 2013 2013-01-01 05:00:00 JFK MIA N619AA AA Miami Intl 25.8 -80.3
## 4 2013 2013-01-01 05:00:00 JFK BQN N804JB B6 <NA> NA NA
## 5 2013 2013-01-01 06:00:00 LGA ATL N668DN DL Hartsfiel… 33.6 -84.4
## 6 2013 2013-01-01 05:00:00 EWR ORD N39463 UA Chicago O… 42.0 -87.9
## 7 2013 2013-01-01 06:00:00 EWR FLL N516JB B6 Fort Laud… 26.1 -80.2
## 8 2013 2013-01-01 06:00:00 LGA IAD N829AS EV Washingto… 38.9 -77.5
## 9 2013 2013-01-01 06:00:00 JFK MCO N593JB B6 Orlando I… 28.4 -81.3
## 10 2013 2013-01-01 06:00:00 LGA ORD N3ALAA AA Chicago O… 42.0 -87.9
## # ℹ 336,766 more rows
## # ℹ 4 more variables: alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>
The other three mutating joins (inner_join(),
right_join(), full_join()) all have the same
interface as left_join(). The difference is which rows they
keep: left join keeps all the rows in x, the right join
keeps all rows in y, the full join keeps all rows in either
x or y, and the inner join only keeps rows
that occur in both x and y.
We’ll come back to these in more detail later.
As you might guess from the name the primary action of a filtering join is to filter the rows. There are two types: semi-joins and anti-joins.
Semi-joins keep all rows in x that have
a match in y.
For example, we could use a semi-join to filter the
airports dataset to show just the origin airports:
airports %>%
semi_join(flights.simple, join_by(faa == origin))
## # A tibble: 3 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 EWR Newark Liberty Intl 40.7 -74.2 18 -5 A America/New_York
## 2 JFK John F Kennedy Intl 40.6 -73.8 13 -5 A America/New_York
## 3 LGA La Guardia 40.8 -73.9 22 -5 A America/New_York
Or perhaps just the destinations:
airports %>%
semi_join(flights.simple, join_by(faa == dest))
## # A tibble: 101 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 ABQ Albuquerque International Sunport 35.0 -107. 5355 -7 A Ameri…
## 2 ACK Nantucket Mem 41.3 -70.1 48 -5 A Ameri…
## 3 ALB Albany Intl 42.7 -73.8 285 -5 A Ameri…
## 4 ANC Ted Stevens Anchorage Intl 61.2 -150. 152 -9 A Ameri…
## 5 ATL Hartsfield Jackson Atlanta Intl 33.6 -84.4 1026 -5 A Ameri…
## 6 AUS Austin Bergstrom Intl 30.2 -97.7 542 -6 A Ameri…
## 7 AVL Asheville Regional Airport 35.4 -82.5 2165 -5 A Ameri…
## 8 BDL Bradley Intl 41.9 -72.7 173 -5 A Ameri…
## 9 BGR Bangor Intl 44.8 -68.8 192 -5 A Ameri…
## 10 BHM Birmingham Intl 33.6 -86.8 644 -6 A Ameri…
## # ℹ 91 more rows
Anti-joins are the opposite: they return all rows in
x that don’t have a match in y. These are
useful for finding missing values that are implicit in the data (since
implicitly missing values don’t show up as NAs but instead
only exist as an absence)
For example, we can find rows that are missing from
airports by looking for flights that don’t have a matching
destination airport:
flights.simple %>%
anti_join(airports, join_by(dest == faa)) %>%
distinct(dest)
## # A tibble: 4 × 1
## dest
## <chr>
## 1 BQN
## 2 SJU
## 3 STT
## 4 PSE
Plenty more we could talk about but let’s leave pivoting (and visualizing joins) for next time…
Have a nice weekend!