Factorial designs

So far we’ve mostly looked at one-way experimental designs. However, in many research contexts we’re interested in evaluating the effect of more than one factor on our response. We also talked in the last several lectures about how one should be cautious when adding arbitrary interactions terms to their regressions which can affect interpretability. These two topics actually end up dove-tailing nicely…

A factorial experiment is simply one where our treatments reflect the combination of all possible levels of our experimental factors. A very common example is the $2\times2$ factorial experiment, which is an experiment that has two factors, each of which has two levels. (Too much beyond this can quickly get unwieldy).

A representative 2x2 design

Consider the example of a simple acceptability judgment study in in Dillon and Hornstein (2013). The authors measured acceptability ratings on a 1-7 rating scale, and they manipulated two features of their stimuli: the presence or absence of a wh-dependency (the WH factor), and the syntactic category of the verb’s argument (NP or small clause complement, the CAT factor). Crossing these two factors led to four unique conditions, our treatments:

-WH,SC: Mary heard the sneaky burglar clumsily attempt to open the door.
-WH,NP: Mary heard the sneaky burglar’s clumsy attempt to open the door.
+WH,SC: What did Mary hear the sneaky burglar clumsily attempt to open?
+WH,NP: What did Mary hear the sneaky burglar’s clumsy attempt to open?

Factorial designs are popular, and with good reason: they are very efficient ways of collecting data about the effects of different factors of interest. The reason for this is that in the context of a factorial design, every observation gives us information about the effect of both of our factors. That is, any judgment offered in this study gives us information about the effect of wh-movement on judgments, as well as the effect of complement category. This makes these designs very efficient for comparing the effects of two orthogonal factors, and will result in a more powerful statistical test.

Let’s look at a sample version of the data first which contains subject-level means for the conditions of interest. If there were 48 total participants and the design was fully between-subjects, how many subject-means per combined-factor will we have? Try to think through this in your head first and then calculate it quickly in R (a one-liner call should do it). When you do so, also calculate the factor-level rating means for each of the four crossed-conditions.

islands.data <- read.table('https://spencercaplan.org/teaching/F25/Statistics/data/sample-islands-data.txt',
                           header=T)
head(islands.data, 8)

##   subj gap_site movement    value
## 1    1       NP      -WH 5.955382
## 2    2       NP      -WH 6.153561
## 3    3       NP      -WH 6.333972
## 4    4       NP      -WH 6.369240
## 5    5       NP      -WH 6.568217
## 6    6       NP      -WH 5.532630
## 7    7       NP      -WH 5.879555
## 8    8       NP      -WH 6.024270

# Don't forget to set up the factors correctly at the start!
islands.data$gap_site <- as.factor(islands.data$gap_site)
islands.data$movement <- as.factor(islands.data$movement)

(Try not to peak at the answer in the source below until you’ve attempted this yourself!)

Factorial questions

In factorial experiments of this sort, there are a number of questions we might ask. In this instance, we might ask: does the CAT factor have a reliable effect on the responses? How about the WH factor? And are these factors independent, or do they interact?

Let’s consider what it would mean for there to be a general effect of CAT. If we average across the rows of our table, we have the marginal means for each level of the CAT factor:

islands.data %>% group_by(gap_site) %>% dplyr::summarise(rating = mean(value))

## # A tibble: 2 × 2
##   gap_site rating
##   <fct>     <dbl>
## 1 NP         4.55
## 2 SC         5.94

These marginal means suggest that there is probably an effect of CAT: if we collapse across levels of the WH factor, we see a difference of more than a point on our rating scale. We’ll call this the main effect of CAT: from this point of view, it seems that people prefer SC complements to NP complements in our stimuli….

Likewise, averaging across the columns of the table will give us the marginal means for each level of the WH factor:

islands.data %>% group_by(movement) %>% dplyr::summarise(rating = mean(value))

## # A tibble: 2 × 2
##   movement rating
##   <fct>     <dbl>
## 1 -WH        6.36
## 2 +WH        4.13

Woah! An even larger descriptive difference. This looks like an overall 2 point drop due to WH-movement: the main effect of WH. People tend to rate sentences with WH-movement as worse than sentences without WH-movement.

Each of these main effects may be of theoretical interest. For instance, we may be interested in the main effect of WH as an index of the processing difficulty associated with constructing a filler-gap dependency. We may also be interested in the main effect of complement as an index of the preferred syntactic expression of event-denoting complements of perception verbs.

Interactions

However, look at our full set of means again that you derived initially (like, actually give it a think). We see that the marginal main effect of WH doesn’t really give us the full picture. If we look at our data, we see that that marginal main effect arises because there is a 1 point drop due to WH-movement in the SC conditions, and a 3 point drop in the NP conditions. In other words, WH and CAT seem to interact: the size of the effect of one factor depends on the level of the other. Any effect of this we will call the interaction of WH and CAT. And this looks especially damning for the main effect of CAT: there’s almost no difference due to CAT for non-WH-movement structures.

How to read an interaction

We thought of the main effects as differences in the marginal means. We can think of the interaction as a difference in differences.

This is easiest to see with a quick plot:

# If you're just doing exploratory analysis sometimes just using the base R plot is easier than a full-fledged ggplot2 solution. There's a built-in "interaction plot" for exactly this scenario

with(islands.data,
     interaction.plot(movement, gap_site, value)
     )

For a little bit of practice though, quickly try the following:

Derive the standard errors for each of these means
Use ggplot to visualize this:

plot the means as points,
add lines (geom_line) to connect matching conditions (as with the simple R plot above)
Use your se estimates to add horizontal bars via geom_errorbar
Color in the gap_site condition values differently (use any colors you prefer or leave the defaults but I like “red” and “dodgerblue” for a simple contrast like this)

Leaving aside the ggplot business, above I created an interaction.plot(). This plots the mean for each combination of the factors, with one factor on the x-axis, and the other factor represented as two different lines. This allows us to visualize the effects our two factors might have on our response. Here we see that the +WH conditions are overall lower than the -WH conditions: people rate sentences a little less acceptable when they contain a wh-dependency. We see that this effect is also dependent on the position of the gap site: the negative effect of +WH movement is greater when the sentence contains an NP direct object than when it contains a small clause complement.

Again, this pattern of means suggests that our factors interact; they do not represent independent dimensions of our stimuli. This is, of course, exactly what is predicted by theories that posit grammatical island constraints: there should be a drop in acceptability associated with extraction from an island domain that is significantly larger than any drop in acceptability that occurs under extraction from a non-island domain.

Other (non)-interaction patterns

Possibility 1

But what would the interaction plot have looked like if they were independent factors? Suppose we observed only an effect of WH-movement, but no effect of category (CAT):

hypothetical.data <- read.table('https://spencercaplan.org/teaching/F25/Statistics/data/hypothetical-interactions.txt',
                                header=T)
with(hypothetical.data, 
     interaction.plot(movement, gap_site, value1, col='red', lwd=3)
     )

In this plot, we can see that there is a constant effect of WH-movement that is the same across the levels of CAT, but there is not any real effect of the CAT factor itself. We can check this by looking at the marginal means for each factor:

hypothetical.data %>% group_by(gap_site) %>% dplyr::summarise(rating = mean(value1))

## # A tibble: 2 × 2
##   gap_site rating
##   <chr>     <dbl>
## 1 NP          5.1
## 2 SC          5

hypothetical.data %>% group_by(movement) %>% dplyr::summarise(rating = mean(value1))

## # A tibble: 2 × 2
##   movement rating
##   <chr>     <dbl>
## 1 +WH        6.05
## 2 -WH        4.05

This shows a negligible difference in the marginal means for the CAT factor, but a large difference in the marginal means for the WH factor.

If only WH movement has an effect on our response, something like the following would be a reasonable statistical model of our data:

\[ x_{ijk} = \mu + \alpha_{wh_i} + \epsilon_{ijk} \]

Let $i$ refer to the current level of the WH factor, $j$ refer to the current level of the CAT factor, and $k$ to the index of the observation within a treatment.

So $\alpha_{wh_i}$ is the effect of being in the ith level of WH, and $\epsilon_{ijk}$ is the residual associated with an individual observation $x_{ijk}$. This model asserts that only constant factors and the WH factor contribute predictable variation to our observations. Everything else (the residuals) is just normally distributed noise that our model does not explain.

Possibility 2

We might have also seen the opposite pattern: a constant effect of CAT on the ratings, but no effect of WH-movement:

with(hypothetical.data,
     interaction.plot(movement, gap_site, value2, col='red', lwd=3))

And again, in this situation we would only see an effect in the marginal means associated with the CAT factor:

hypothetical.data %>% group_by(gap_site) %>% dplyr::summarise(rating = mean(value2))

## # A tibble: 2 × 2
##   gap_site rating
##   <chr>     <dbl>
## 1 NP            4
## 2 SC            6

hypothetical.data %>% group_by(movement) %>% dplyr::summarise(rating = mean(value2))

## # A tibble: 2 × 2
##   movement rating
##   <chr>     <dbl>
## 1 +WH           5
## 2 -WH           5

If only CAT has an effect on our response, then an appropriate statistical model of our data would be:

\[ x_{ijk} = \mu + \alpha_{cat_j} + \epsilon_{ijk} \]

Where $\alpha_{cat_j}$ is the effect of being in the j-th level of CAT.

Possibility 3

A third possible outcome would have been that the two effects each exerted a constant effect on the response variable, but each did so independently of the other. That would look like:

with(hypothetical.data,
     interaction.plot(movement, gap_site, value3, col='red', lwd=3)
     )

Independent effects of this sort can be diagnosed in the interaction plot by observing parallel lines. In this case, the marginal means would both suggest a difference, and that difference would be the same within each level of the other factor. Inspection of the marginal means for each factor, and the condition means, supports this view:

hypothetical.data %>% group_by(gap_site) %>% dplyr::summarise(rating = mean(value3))

## # A tibble: 2 × 2
##   gap_site rating
##   <chr>     <dbl>
## 1 NP            3
## 2 SC            5

hypothetical.data %>% group_by(movement) %>% dplyr::summarise(rating = mean(value3))

## # A tibble: 2 × 2
##   movement rating
##   <chr>     <dbl>
## 1 +WH           3
## 2 -WH           5

hypothetical.data %>% group_by(gap_site, movement) %>% dplyr::summarise(rating = mean(value3))

## `summarise()` has grouped output by 'gap_site'. You can override using the
## `.groups` argument.

## # A tibble: 4 × 3
## # Groups:   gap_site [2]
##   gap_site movement rating
##   <chr>    <chr>     <dbl>
## 1 NP       +WH           2
## 2 NP       -WH           4
## 3 SC       +WH           4
## 4 SC       -WH           6

Here, the marginal means suggest that the main effect associated with each factor is 2 points. Looking at the condition means suggests that that 2 point difference holds within each level of the other factor.

In this last set of hypothetical data (and in fact, in the two prior), the effects of WH and CAT are both independent of each other. The visual clue to this is the parallel lines in the interaction plot. The lines in an interaction plot will be parallel if the effects of our two factors are strictly additive. As this term suggests, if WH and CAT were independent of each other, the following would be a reasonable statistical model of the data:

\[ x_{ijk} = \mu + \alpha_{wh_i} + \alpha_{cat_j} + \epsilon_{ijk} \]

That is, we could estimate the effect of WH-movement, we could estimate the effect of CAT, and we could add them together to get a predicted value for any one treatment/condition. Each exerts a predictable / stable influence on the response, and each effect is independent of the other.

Crossing effects

It’s worth stopping here to reflect on this point. Consider our judgment to the following island violating sentence in isolation:

*What did Mary hear the sneaky burglar’s clumsy attempt to open?

We know we don’t like it (I hope!). Maybe it’s a 2 on the rating scale, as in the last set of hypothetical data. The traditional explanation of this observation is that there is a specific grammatical constraint that bans extraction from certain syntactic environments. But as the last plot suggests, it is in principle possible to achieve a low rating (our intuitive observation) by simply adding up penalties accrued by independent factors or constraints. That is, the low level of acceptability we perceive for the island violation could simply be just a penalty for WH-movement plus a penalty for constructing a complex NP.

This is not a pattern predicted if there is a grammatical island constraint; if there is such a constraint, then the effect of WH-extraction should be larger when extracting from NP environments than from SC environments.

Our actual interaction plot didn’t look like these. Instead, that plot suggested that the size of the WH effect depends on the level of the CAT factor. That is, it seems that our two factors interact. Interactions can be diagnosed in the interaction plot by non-parallel lines. Here’s a hypothetical example:

with(hypothetical.data,
     interaction.plot(movement, gap_site, value4, col='red', lwd=3)
     )

In this graph we have a classic “crossing” interaction: the effect of WH-movement is completely opposite in one level of the CAT factor than in the other level. Our actual data shows a non-crossing interactive pattern: the effect of WH-movement is in the same direction for both levels of the CAT factor, it just happens to be much larger for the NP level of the CAT factor:

with(islands.data,
     interaction.plot(movement, gap_site, value,col='purple', lwd=3))

The statistical model we would want for this would include another term to account for the interaction of WH movement and CAT:

\[ x_{ijk} = \mu + \alpha_{wh_i} + \alpha_{cat_j} + \alpha_{whxcat_{ij}} + \epsilon_{ijk} \]

The two-way ANOVA (remember that’s just a multiple linear regression using categorical predictors) will allow us to evaluate statistical models like that, that consist of the main effects of each of our experimental factors, as well as their interaction.

Why we need the interaction term here

In this kind of scenario it is critical that we do include the interaction term in our regression model. It’s theoretically motivated, is part of the original experimental design, and as a clear interpretation. What’s more, because the factors in question (as per the experimental design) are categorical then if we diagnose an interaction in the full model, we can choose to split the data and analyze each part separately if we wish to consider the main effects in isolation.

This enterprise is very different from adding arbitrary factors or interactions to a model simply to try and boost $R^2$

Modeling with an interaction term

There are three distinct null hypotheses of interest here:

$H_{0:WH}: \alpha_{wh_{+WH}} = \alpha_{wh_{-WH}} = 0$
$H_{0:CAT}: \alpha_{cat_{NP}} = \alpha_{cat_{SC}} = 0$
$H_{0:WHxCAT}: \alpha_{whxcat_{NP:+WH}} = \alpha_{whxcat_{NP:-WH}} = \alpha_{whxcat_{SC:+WH}} = \alpha_{whxcat_{SC:-WH}} = 0$

$H_{0:WH}$ and $H_{0:CAT}$ are the null hypotheses that there is no main effect of WH or CAT, respectively. $H_{0:WHxCAT}$ is the null hypothesis that these two factors do not interact.

An important observation about the main effects of the factors, and their interaction, is that they all capture orthogonal, non-overlapping portions of the variance (if you have an equal number of observations in each cell). Thus, we can actually partition the total sum of squares in a two-way experiment in the following manner:

\[ SS_{total} = SS_{A}+SS_{B}+SS_{AB}+SS_{error} \]

Where A and B are the two factors. If we have equal number of observations in each group, the sum of squares for each of these terms is calculated in the following manner. Suppose we have a levels in A, b levels in B, and n observations in each treatment (people generally also use N to denote the total observations across all levels, though that’s not used directly below):

\[ SS_{TOTAL} = \sum^n_{k=1} \sum^a_{i=1} \sum^b_{j=1} (x_{ijk}-\bar{x})^2 \]

\[ SS_{A} = nb \sum^a_{i=1} (\bar{x}_{i}-\bar{x})^2 \]

\[ SS_{B} = na \sum^b_{j=1} (\bar{x}_{j}-\bar{x})^2 \]

\[ SS_{AB} = n \sum^a_{i=1} \sum^b_{j=1} (\bar{x}_{ij}-\bar{x}_{i}-\bar{x}_{j}+\bar{x})^2 \]

\[ SS_{ERROR} = \sum^n_{k=1} \sum^a_{i=1} \sum^b_{j=1} (x_{ijk}-\bar{x}_{ij})^2 \]

$\bar{x}_i$ and $\bar{x}_j$ represent the marginal means in our experiment: the mean for one level of a factor, collapsing across all other levels of the factor. $\bar{x}_{ij}$ is the condition or treatment mean.

Of all these SS terms, the interaction term $SS_{AB}$ is the one that probably looks the most unusual. This term is computing the difference between the observed cell mean and the value that would be predicted if there were additive effects of A and B. This difference will be 0 when the factors are strictly additive, and so there will be little in the way of variance to be accounted for in the interaction.

As with the one-way ANOVA, we can get the mean squares for each of these terms by dividing by the appropriate degrees of freedom. For the two-way ANOVA, these are:

$MS_{TOTAL} = SS_{TOTAL}/(N-1)$

$MS_{A} = SS_{A}/(a-1)$

$MS_{B} = SS_{B}/(b-1)$

$MS_{AB} = SS_{AB}/((a-1)(b-1))$

$MS_{ERROR} = SS_{ERROR}/(N-ab)$

From these mean squared errors, we can devise three $F$-tests, one for each of our null hypotheses:

$F_{A} = MS_{A}/MS_{ERROR}$

$F_{B} = MS_{B}/MS_{ERROR}$

$F_{AB} = MS_{AB}/MS_{ERROR}$

Let’s look at this in action with our data:

model <- lm(value ~ movement * gap_site, data=islands.data)
anova(model)

## Analysis of Variance Table
## 
## Response: value
##                   Df Sum Sq Mean Sq F value    Pr(>F)    
## movement           1 59.395  59.395 211.662 < 2.2e-16 ***
## gap_site           1 23.236  23.236  82.806 1.128e-11 ***
## movement:gap_site  1 12.854  12.854  45.809 2.502e-08 ***
## Residuals         44 12.347   0.281                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A few questions to discuss here:

do the degrees of freedom look reasonable? How can we map these values back onto the original design / data?
what may we conclude from these tests as constructed?

Really put your understanding to the test

Check out what happens if instead of anova(model) you instead generate summary(model).

You will see that the outcome (particularly the effect of gap_site) looks quite a bit different. Why is this?

hint: think about how the coding of categorical variables works by default in R. And check whether that’s exactly what we want or not. To really confirm things, try changing the coding scheme using e.g. contrasts(islands.data$gap_site) <- contr.sum(2) and see how that affects the results.

Finally, try to impose a different coding scheme (how about helmert coding) and see if you can get the output of summary(model) to align with our intuitions about the data.

It seems that all three tests were (highly) significant. There is a main effect of both CAT and WH, and an interaction. Before jumping right to the inference, however, we need to exercise a bit of caution. It is difficult to interpret our main effects in the presence of an interaction, especially in the presence of a non-crossing interaction like the one we have. Since the interaction tells us that the effect of WH depends on the level of the CAT factor, we would want to know whether there is a significant effect of WH within each level of the CAT factor before we make anything of our main effect. Again, Tukey’s HSD comes in handy here:

TukeyHSD(aov(model))

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = model)
## 
## $movement
##              diff       lwr       upr p adj
## +WH--WH -2.224757 -2.532945 -1.916569     0
## 
## $gap_site
##           diff      lwr      upr p adj
## SC-NP 1.391525 1.083337 1.699713     0
## 
## $`movement:gap_site`
##                     diff        lwr        upr     p adj
## +WH:NP--WH:NP -3.2597470 -3.8371621 -2.6823319 0.0000000
## -WH:SC--WH:NP  0.3565349 -0.2208802  0.9339500 0.3627474
## +WH:SC--WH:NP -0.8332322 -1.4106473 -0.2558171 0.0020564
## -WH:SC-+WH:NP  3.6162819  3.0388668  4.1936970 0.0000000
## +WH:SC-+WH:NP  2.4265148  1.8490998  3.0039299 0.0000000
## +WH:SC--WH:SC -1.1897671 -1.7671822 -0.6123520 0.0000106

Question: what do these post-hoc comparisons suggest we might conclude about the main effect of WH? What about the main effect of CAT?

Plotting factorial designs

Let’s wrap this section up by doing a little more plotting. You should have already organized the data in a plot-friendly format when we started but here’s that again for reference:

summary.plot.data <- islands.data %>% 
  group_by(gap_site, movement) %>% 
  dplyr::summarise(rating = mean(value),
                   n = n(),
                   se = sd(value)/sqrt(n)
                   )

## `summarise()` has grouped output by 'gap_site'. You can override using the
## `.groups` argument.

summary.plot.data

## # A tibble: 4 × 5
## # Groups:   gap_site [2]
##   gap_site movement rating     n     se
##   <fct>    <fct>     <dbl> <int>  <dbl>
## 1 NP       -WH        6.18    12 0.0916
## 2 NP       +WH        2.92    12 0.223 
## 3 SC       -WH        6.54    12 0.0917
## 4 SC       +WH        5.35    12 0.164

ggplot(summary.plot.data,
       aes(x=gap_site, y=rating, fill=movement)
       ) + 
  ylim(0,7) + 
  geom_bar(position='dodge', stat = "identity", size=3) + 
  geom_errorbar(position = position_dodge(width = 0.9),
                aes(ymax = rating+se, ymin=rating-se), width=0.25) + 
  scale_fill_manual(values=c("maroon4", "darkgrey")) + 
  xlab("Gap position") + 
  ylab("Average rating") + 
  ggtitle("Acceptability ratings") +
  theme_bw(base_size = 16)

## Warning in geom_bar(position = "dodge", stat = "identity", size = 3): Ignoring
## unknown parameters: `size`

A more robust and informative way to visualize factorial results these days might be with a box-whisker plot split up by the different factors and with dots showing the individual-level results (here subject-means). See if you can construct that yourself using:

geom_boxplot
geom_jitter (so the individual points don’t overlap too much to be readable)

(I’ve got a solution commented out below… NO PEAKING until you try it yourself!)

Quick, practical Logistic Regression

Before we call it a day let’s walk through how to construct a logistic regression model in R. It’s very simple, and extremely similar to specifying and fitting a linear regression. The function we want is glm() which stands for generalized linear model rather than lm()

The other major thing to note (at least for now) is that you should specify the “family” of distribution that you are modeling under the glm() call (this is required since the glm framework is more powerful than just fitting logistic regression). Thus a simple call would look like this:

my.logit.model <- glm(binaryOutcome ~ factor1 * factor2 + factor3, data = df, family = binomial)

What was logistic regression again?

With linear regression, we are attempting to build a mathematical model to predict a continuous outcome. But often in psycholinguistics we’re dealing with categorical outcomes (2AFC responses, grammaticality judgements, etc.). And so we’d like to be able to model that as well.

Deep down, this is really just a linear regression, but we pass the summed up weights and biases ($\beta$s and predictor values + intercept(s)) through some non-linear “activation” function – here that’s the sigmoid

\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]

BTW: the name “logistic” in logistic regression refers to the logit function which is just the natural inverse of the sigmoid:

\[ \operatorname{logit}(p) = \ln\!\left(\frac{p}{1-p}\right) \]

library(knitr)

df <- data.frame(
  Function = c("Sigmoid", "Logit"),
  Input = c(
    "$-\\infty$ to $\\infty$ (log-odds)",
    "$(0,1)$ (probability)"
  ),
  Output = c(
    "$(0,1)$ (probability)",
    "$-\\infty$ to $\\infty$ (log-odds)"
  ),
  Meaning = c(
    "Converts linear scores to probability",
    "Inverse of sigmoid"
  )
)

kable(df, escape = FALSE, align = "lccc")

Function	Input	Output	Meaning
Sigmoid	$-\infty$ to $\infty$ (log-odds)	$(0,1)$ (probability)	Converts linear scores to probability
Logit	$(0,1)$ (probability)	$-\infty$ to $\infty$ (log-odds)	Inverse of sigmoid

Here’s some sample data from the mosaicData package. Women were interviewed at Age about whether they were a Smoker¹. Outcome encodes whether they were alive or dead 20 years later. We’ll encode “Alive” as 0, “Dead” as 1

# install.packages("mosaicData")
library(mosaicData)
glimpse(Whickham)

## Rows: 1,314
## Columns: 3
## $ outcome <fct> Alive, Alive, Dead, Alive, Alive, Alive, Alive, Dead, Alive, A…
## $ smoker  <fct> Yes, Yes, Yes, No, No, Yes, Yes, No, No, No, No, Yes, No, Yes,…
## $ age     <int> 23, 18, 71, 67, 64, 38, 45, 76, 28, 27, 28, 34, 20, 72, 48, 45…

Why not just linear regression?

Can’t we just use linear regression here? (You might be wondering)?

Whickham$outcome_num <- as.numeric(Whickham$outcome == "Dead")

smoking.plot.linear <- ggplot(Whickham, aes(x = age, y = outcome_num, color = smoker)) + 
  geom_jitter(width = 0.8, height = 0.3, alpha = 0.7, size = 2) +
  scale_color_manual(values = c("dodgerblue", "red")) +
  geom_smooth(
    method = "lm",
    se = FALSE,
    linetype = "dashed",
    color = "black",
    linewidth = 1
  ) +
  labs(
    x = "Age",
    y = "Outcome (Alive / Dead)",
    color = "Smoker"
  ) +
  theme_minimal(base_size = 12)

smoking.plot.linear

## `geom_smooth()` using formula = 'y ~ x'

smoking.plot.logistic <- ggplot(Whickham, aes(x = age, y = outcome_num, color = smoker)) + 
  geom_jitter(width = 0.8, height = 0.3, alpha = 0.7, size = 2) +
  scale_color_manual(values = c("dodgerblue", "red")) +
  geom_smooth(
    method = "glm",
    method.args = list(family = binomial),
    se = FALSE,
    linewidth = 1.2
  ) +
  labs(
    x = "Age",
    y = "Outcome (Alive / Dead)",
    color = "Smoker"
  ) +
  theme_minimal(base_size = 12)

smoking.plot.logistic

## `geom_smooth()` using formula = 'y ~ x'

We could in a mathematical sense, but is that justified here?

Our predicted linear trend line will go below zero or above one depending on age…. that doesn’t seem right?
And it will predict mid-points of 0.5 for “aliveness”… that doesn’t seem right?
Rather, we want probabilities over the binary outcome of alive vs. dead

Model specification

logistic.model <- glm(outcome ~ age + smoker, data = Whickham, family = "binomial")
logistic.model

## 
## Call:  glm(formula = outcome ~ age + smoker, family = "binomial", data = Whickham)
## 
## Coefficients:
## (Intercept)          age    smokerYes  
##     -7.5992       0.1237       0.2047  
## 
## Degrees of Freedom: 1313 Total (i.e. Null);  1311 Residual
## Null Deviance:       1560 
## Residual Deviance: 945   AIC: 951

Understanding the coefficients

How do we interpret these?

coef_df <- data.frame(
  term = names(coef(logistic.model)),
  estimate = coef(logistic.model)
)

coef_df

##                    term   estimate
## (Intercept) (Intercept) -7.5992213
## age                 age  0.1236832
## smokerYes     smokerYes  0.2046991

These don’t give us a classification directly, nor do they map on to our logistic regression curve in an obvious way? So what’s going on?

If you specify our formula within the glm() call, it’s really this:

\[ y = w_0 + w_1x_1 + w_2x_2 \]

Where $X_1$ is age and $x_2$ is smoker. Then applying the non-linear transformation (which you can do in R via plogis() by hand if you want) we’re getting:

\[ p = \dfrac{1}{1 + e^{-( w_0 + w_1x_1 + w_2x_2)}} \] And the call to glm() is giving us those weights.

How to interpret those weights then? Well, since they are in log-odds space it’ll be easier if we exponentiate things back into non-log space:

exp(coef(logistic.model))

##  (Intercept)          age    smokerYes 
## 0.0005008413 1.1316572762 1.2271557224

This means that each additional year multiplies the odds of a person dying by about 1.132. And similarly, being a smoker in the study means that one had 1.23 times the odds of dying (during the study period) compared to non-smokers.

Calling summary on logistic regressions gives us the same kind of information as it does for linear models:

summary(logistic.model)

## 
## Call:
## glm(formula = outcome ~ age + smoker, family = "binomial", data = Whickham)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -7.599221   0.441231 -17.223   <2e-16 ***
## age          0.123683   0.007177  17.233   <2e-16 ***
## smokerYes    0.204699   0.168422   1.215    0.224    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1560.32  on 1313  degrees of freedom
## Residual deviance:  945.02  on 1311  degrees of freedom
## AIC: 951.02
## 
## Number of Fisher Scoring iterations: 6

The z-values come from the the idea that our approximation of the true $\beta$s here is normally distributed and thus can be derived as we’ve done before:

\[ z_j = \dfrac{\hat{\beta}_j}{SE(\hat{\beta}_j)} \]

And the $p$-value is testing the null hypothesis that $\beta_j = 0$. So, as with the rest of our modeling enterprise, be careful when interpreting the (very tempting looking) p-values you see in a regression table. They do not directly reflect the hypothesis of interest!

Instead we’ll soon set up a likelihood-ratio test between nested models one with the predictor of interest and one without. (spoiler alert though, if you set up a nested model comparison using the full model we defined above you’ll find that including smoker does not significantly improve model fit…)

Thus, here’s an open-ended challenge for you, try to figure out why smoking doesn’t appear to be a significant predictor of mortality here! (Since, unlike Fisher, we know from a variety of evidence that it is a predictor of mortality.)

Hint: see if there is any correlation between age and smoking in the dataset. Also consider if there’s an interaction between age and smoking? How about the relatively scale between the two predictors?

Although ironically and bizarrely Fisher (yes, that Fisher) was an anti-anti-smoking advocate ↩︎

P10 (Factorial Designs + Logistic Regression)

Spencer Caplan – Statistics for Linguistics Research

2025-12-04