Picking up from last time

First let’s wrap up the little project looking at hypothetical data from the processing of syntactic islands HERE



Detecting an effect

Thinking back to one of our first examples of the semester, suppose you’re a linguist asking yes / no acceptability judgments about a certain sentence.

You may want to establish a reliable trend in the acceptability judgments; that is, you want to reject the null hypothesis that people are giving yes/no judgments at random. If you ask 5 people, how many have to give the same judgment to reject the null hypothesis?


What about when you ask 6, 8, or 10?


We now have the tools to start answering this kind of question (even before we gather any data!). This is exactly the kind of logic behind a power analysis. Let’s look at this by simulation first before we formalize things.

Starting with the \(n=5\) case and \(H_0:p=0.5\) then we can work out the expected yess and nos accordingly:

n <- 5 
k_yes <- 0:n  
k_no  <- n - k_yes
p_null <- 0.5
E_yes <- n * p_null
E_no  <- n * (1 - p_null)
chisq <- (k_yes - E_yes)^2 / E_yes + (k_no - E_no)^2 / E_no

pval_chisq <- pchisq(chisq, df = 1, lower.tail = FALSE)

data.frame(k_yes, k_no, chisq, pval_chisq)
##   k_yes k_no chisq pval_chisq
## 1     0    5   5.0 0.02534732
## 2     1    4   1.8 0.17971249
## 3     2    3   0.2 0.65472085
## 4     3    2   0.2 0.65472085
## 5     4    1   1.8 0.17971249
## 6     5    0   5.0 0.02534732


How do we read this? It’s telling us that anything short of a unanimous sample here would be insufficient evidence on the basis of \(n=5\) to reject the null that respondents provide purely random responses.


We know by now of course that increasing the sample size makes our estimates better, but we now also have enough information in place to calculate an exact minimum observed sample proportion required to reject the null as a function of \(n\).


When \(df=1\) then our minimum \(\chi^2\) statistic to reject \(H_0\) is about 3.84:

qchisq(0.95, df = 1)
## [1] 3.841459

So what proportion of aligned responses will generate a \(\chi^2>3.84\)? Let’s take a look visually… (The generating code is in the source below but the resulting figures are the important part)

alpha <- 0.05
crit <- qchisq(1 - alpha, df = 1)

For \(n\) between 2 and 50:

N <- 2:50
p_crit_high <- 0.5 + sqrt(crit / N) / 2
p_crit_low  <- 0.5 - sqrt(crit / N) / 2
data <- data.frame(N, p_crit_low, p_crit_high)

ggplot(data, aes(x = N)) +
  geom_ribbon(aes(ymin = p_crit_low, ymax = p_crit_high),
              fill = "gray85", alpha = 0.6) +
  geom_line(aes(y = p_crit_high), color = "red", linewidth = 1.2) +
  geom_line(aes(y = p_crit_low),  color = "red", linewidth = 1.2) +
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "black") +
  geom_hline(yintercept = 0, color = "black", linewidth = 0.4) + 
  geom_hline(yintercept = 1, color = "black", linewidth = 0.4) +
  geom_vline(xintercept = 4, color = "blue", linewidth = 0.4) +
  scale_y_continuous(
    breaks = seq(-0.1, 1.1, by = 0.1),
    labels = scales::percent_format(accuracy = 1),
    expand = expansion(mult = c(0.05, 0.05))
  ) +
  labs(
    title = expression(paste("Detectable deviation from chance in a ", chi^2, " (1, n) test -- small n")),
    x = "Number of respondents (N)",
    y = "Percent responding 'yes'"
  ) +
  theme_bw(base_size = 16)

Note scale theoretically is shown here extending before 100% and below 0%: this (blue vertical line where this crossing point is) is showing us how it’s not possible to reject null here with \(n<4\)


Also note how this scales: even with 60/40% responses we’d still fail to reject the null even at 50 respondents here. If you wanted to detect a relatively small difference from 50/50% responding, you’d need to run a reasonably large sample here:

N <- 50:2000
p_crit_high <- 0.5 + sqrt(crit / N) / 2
p_crit_low  <- 0.5 - sqrt(crit / N) / 2
data <- data.frame(N, p_crit_low, p_crit_high)

ggplot(data, aes(x = N)) +
  geom_ribbon(aes(ymin = p_crit_low, ymax = p_crit_high),
              fill = "gray85", alpha = 0.6) +
  geom_line(aes(y = p_crit_high), color = "red", size = 1.2) +
  geom_line(aes(y = p_crit_low),  color = "red", size = 1.2) +
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "black") + 
  scale_y_continuous(
    breaks = seq(0.4, 0.6, by = 0.05),
    labels = scales::percent_format(accuracy = 1),
    expand = expansion(mult = c(0.05, 0.05))
  ) +
  labs(
    title = expression(paste("Detectable deviation from chance in a ", chi^2, " (1, n) test -- large n")),
    x = "Number of respondents (N)",
    y = "Percent responding 'yes'"
  ) +
  theme_bw(base_size = 16)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Types of errors

Up to know, we’ve mostly been concerned with falsely rejecting the null when it is indeed true (claiming there’s an effect when there isn’t) – i.e. a Type I error (false positive)


We’ve basically seen two types of null hypothesis tests: tests for the mean (t-tests) and tests for discrepancies between expected and observed counts (\(\chi^2\) tests). In our NHST paradigm, the way these tests work is that we decide an \(\alpha\) level ahead of time, and we reject the null hypothesis if our p-value is below our chosen \(\alpha\).


However, the above examples showing how \(\chi^2\) scales as a function of \(n\) were really getting at the inverse problem – falsely failing to reject a null hypothesis which is not true – i.e. a type II error (false negative)


Here is a classic visual of this, under which we must claim that the null hypothesis is that a patient under examination is not pregnant:


Just as we used a greek letter, \(\alpha\), to represent our Type I error rate (probability of a false positive), so shall we use a different letter, \(\beta\), to refer to our Type II error rate (a miss, or failure to detect a true effect). The Type II error rate is related to another value of interest, \(1-\beta\), which is known as the power of the statistical test. The power of the statistical test is the probability that we will reject the null if the null is indeed false. In other words, it’s the probability that we will find an effect if there is one to be found.

We can summarize the logically possible outcomes of our testing set-up in the following table:

\(H_0\): True \(H_0\): False
Decision: Reject \(H_0\) Type I error (\(\alpha\)) True positive (1-\(\beta\))
Decision: Do not reject \(H_0\) True negative (1-\(\alpha\)) Type II error (\(\beta\))

Why care? It has practical consequences for those of you designing experiments. You might ask yourself a question: would I go through all the trouble to run an experiment if I only had a 50% chance of observing my effect? The answer is probably ‘no’: you would like a much greater assurance that you could find an effect if there actually is one.

Thus, a good statistical test is one that has a relatively high power to detect an effect. So the natural question is: what gives a test power? How can we as experimenters deal with this?

A visual intuition

First, let’s think about statistical power with the some visual aids. The visual graphs below are modeled based on those in [Vasishth & Broe (2011), Chapter 4] ().

Consider a one-sample scenario, where we have collected a data sample. We intend to use a one-sample t-test, and our null hypothesis is \(H_0: \mu = 0\). The following is the sampling distribution of \(\bar{x}\) when the null hypothesis is correct. In this case, the distribution should have a mean of zero, and a std dev equal to the standard error:

This is the sampling distribution of our statistic when \(H_0\) is actually correct, and the true mean is 0. We can see that we will commit a Type I error when our sample mean is sufficiently far away from 0; here, we will commit this error when \(\bar{x}\) is more than ~1.96 standard deviations away from the mean. The shaded areas correspond to observed values of \(\bar{x}\) that would lead us to reject the null hypothesis; this area of the distribution is where we are at risk of committing a Type I error.

Let’s add to this plot the distribution of our sample mean when the null hypothesis is not true. Let’s suppose that the true mean of our population was -2 (shown in a color in base R apparently named “cadetblue1”:

cadetblue='#8EE5EE'

true.mean = -2
plot.curve(-12,lower.qnt,my.sd=standard.error,title="Distribution of sample mean for null (mu = 0) and alternative (mu = -2)")
plot.curve(upper.qnt,12,add=T,my.sd=standard.error)
plot.curve(lower.qnt,6,my.mean=-2,my.sd = standard.error, my.col=cadetblue,add=T)

This point here is to try and visualize the relationship between our \(\alpha\) rate and power. If the null hypothesis were false, and our condition’s true mean was -2 with a standard error of 2, then we would commit a Type II error any time we observe an outcome in the “cadetblue” region. That’s us failing to reject the null in a lot of situations where we should reject the null!


Plotted this way, we can see the relationship between \(\alpha\) and the power of a test: as \(\alpha\) goes down, our power will decrease. Let’s make that same plot, but now assume quantiles for an \(\alpha\) level of 0.01.

Now, the amount of Type II error has increased, and so our power has decreased. By decreasing our Type I error rate (changing \(\alpha\)), we have increased our Type II error rate (\(\beta\)). Alternatively, we can say that we have decreased the power of our test (\(1-\beta\)). Indeed, eye-balling this graph, we might be led to the conclusion that even if the alternative were true, we would have a very small chance of detecting it (of getting a value for the test statistic greater than our critical value).

If the sampling distribution of our test statistic under the null and the alternative were more distant, then our Type II error rate would be lower. Suppose that the true mean of the alternative distribution was -6, rather than -2. Here’s what that would look like:

All other things being equal, the farther away our true mean is from the mean of the null hypothesis, the lower our Type II error rate goes, and our power increases. In other words, the bigger an effect is, the more likely we are to detect it.


More generally, the less that the sampling distribution of our test statistic under the alternative overlaps with that of the null, the more likely we are to see an effect. Another way in which we can increase the statistical power of the test is to reduce the spread in the sampling distribution of our test statistics… that is, reduce their variance. One way to do this is to increase the number of observations in our sample.

Question

Why is this again? Why does increasing the number of observations in the sample decrease the spread in our sampling distribution?

This demonstration is intended to highlight the two ways in which we can increase the power of our statistical test: 1. reduce the variance in the distribution of our sampling distribution (collect more data), or 2. increase the effect size (ensure that the mean of your effect is as far away from the null as possible).

Effect sizes for t-tests

It would be good to have a way to talk about effect sizes that is independent of the original scale of measurement (i.e., a standardized effect size). The idea here is that we could have very large–in the sense of detectable–effects even if they’re quite small in an absolute sense (and vice versa)

Standardized effect sizes come in all flavors, and they’re tailored to the specific test you’re conducting. For our t-tests of the sample mean, the most widely encountered effect size measures are known as the d family, a set of very closely related statistics promoted by Cohen (1988; Statistical power analysis for the behavioral sciences)

Two sample

The most common of these is Cohen’s d, which represents the difference between the means of two sample normalized by the sample standard deviation s. This is thus a measure of unit-less effect size expressed in standard deviation units:

\(d = \dfrac{\bar{x}_1 - \bar{x}_2}{s}\)

Cohen’s d originally used the pooled standard deviation from both samples in the denominator, as below:

\(s = \sqrt{\dfrac{(n_1-1)s^2_1 + (n_2-1)s^2_2 }{n_1+n_2-2}}\)

Again, \(s\) here is what is known as the pooled variance. It is an estimate of the population standard deviation that combines estimates from both samples. When the size of both samples is the same, it reduces to the average of both sample variances. When they are unequal, the larger sample size contributes more to the variance.

This expression of Cohen’s d is appropriate for two-sample independent t-tests where you assume that both groups share the same variance. If you do not make this assumption, s is calculated in a slightly different fashion:

\(s = \sqrt{\dfrac{s^2_1 + s^2_2}{2}}\)

The general recommendation (although it’s a descriptive solution / hack really) originating from Cohen (1988) is the following guideline for interpreting the resulting values:

  • d = 0.2 is a small effect
  • d = 0.5 is a medium effect
  • d = 0.8 is a large effect

It’s important to emphasize that these are only a rough guide, and the notion of a ‘small’ or ‘large’ effect may vary depending on the theoretical context.

Effect sizes for one-sample t-tests

The effect sizes for one sample and paired t-tests are as follows:

One sample:

\(d = \dfrac{\bar{x}_1 - \mu_0}{s}\)

Where s is simply the sample standard deviation for your one sample.

Paired samples:

\(d = \dfrac{\bar{x}_{difference}}{s_{difference}}\)

For paired-sample t-tests, d is computed as in the one sample t-test, over the differences between paired observations.

Practice in Colab

Let’s wrap this up by putting it into practice ourselves in another Google Colab notebook