What was correlation again?

We ended up going a little fast at the end of lecture through an important topic, namely, what the pearson correlation actually is so I wanted to walk through it carefully.


But that first requires us to understand covariance



Covariance

The sample covariance between two variables \(X\) and \(Y\) is calculated as:

\[ \text{cov}(X,Y) = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{n-1} \]

Where:

  • \(x_i, y_i\) are individual data points
  • \(\bar{x}, \bar{y}\) are the sample means
  • \(n\) is the sample size

(we’re dividing by \(n-1\) as usual as part of Bessel’s Correction).



In prose: the numerator is iterating over each of the data points in our sample and measuring the deviation between that point and the sample mean. The denominator is then normalizing so that we get the average deviation per point in our data set.



We can confirm this with a little intuition pushing by deriving the covariance manually, and confirming this aligns with the output from the build in R function cov() (which simply calculates the same thing for us)

# Let's take an actual straight line to start
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)

x.bar <- mean(x)  # 3
y.bar <- mean(y)  # 6

deviations <- (x - x.bar) * (y - y.bar)
# (-2*-4) + (-1*-2) + (0*0) + (1*2) + (2*4)
# = 8 + 2 + 0 + 2 + 8 = 20

numerator <- sum(deviations)
denom <- (length(x) - 1) 

cov_manual <- numerator / denom  # 20/4 = 5

# Verify with cov()
cov(x, y)  # 5
## [1] 5


The \(x\) calculation is asking how far each x point is from the sample \(\bar{x}\) mean and the \(y\) calculation is asking the same thing but for y. And then the covariance is just asking whether those deviations change together or not!

  • If when X increases, Y tends to increase we derive a positive covariance
  • if when X increases, Y tends to decrease we derive a negative covariance
  • And we derive a covariance of zero if there is no relationship between the variables
# Try out the above by flipping y to be negative....
x <- c(1, 2, 3, 4, 5)
z <- -1 * y

deviations <- (x - mean(x)) * (z - mean(z) )
cov_manual <- sum(deviations) / (length(x) - 1)   # -20 / 4 = -5
cov_manual
## [1] -5
cov(x, z)  # -5
## [1] -5

Units of covariance

Covariance scales with the size of the original input variables. i.e., if we double all the values in x and y here then see how the covarariance changes quadratically…

cov(2*x, 2*y)
## [1] 20
cov(3*x, 3*y)
## [1] 45
cov(4*x, 4*y)
## [1] 80

We could write this as a general form if we wanted:

\[ cov(a*X, b*Y) = a * b * cov(X,Y) \]

This scaling property reflects the units that covariance is in: units of \(x\) * units of \(y\)….!


So if above the RV \(x\) is measured in kilorgams, while \(y\) is in meters, then cov(x,y) is in the unit: “kilogram-meters”

This often makes covariance difficult to interpret on its own….

Correlation

This motivates our turn to correlation. Specifically what’s known as Pearson’s product moment correlation or the Pearson correlation coefficient which is defined as follows:

\[ r = \frac{\text{cov}(X,Y)}{\sigma_X \cdot \sigma_Y} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}} \] The denominator here (\(\sigma_X \cdot \sigma_Y\)) acts as a normalizing factor that puts a mathematical ceiling on how large the covariance can be relative to the individual variability of X and Y. (…and the sample size in the denominator of each sub-fraction cancels each other out…)


So while cov() can take any value, the correlation (\(r\)) standardizes it by the maximum possible covariance given the variability in each variable individually.


In fact (and this is in Spencer’s favorite-statistics-facts-Hall-of-Fame): Variance (\(sigma^2\)) is actually, mathematically, just the covariance of a variable with itself!

\[ \begin{aligned} \text{var}(X) &= \text{cov}(X,X) \\ &= \frac{\sum_{i=1}^{n}(x_i - \bar{x})(x_i - \bar{x})}{n-1} \\ &= \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1} \end{aligned} \]

THIS is the bridge (to me at least) which explains why we can normalize covariance by dividing by the standard deviation. In doing so we normalize covariance into a unitless variable which ranges between \([-1, 1]\)


Side note

The reason the (Product-moment) correlation coefficient \(r\) has this name: we’re multiplying things together (hence “product”) and the things we multiple are a type of what’s known as a “moment” on the math side of things. Specifically, measures of central tendancy like the mean are the “first moment” while measures of spread like variance or standard deviation are the “second moment”.


(There are other types of correlation, e.g., Spearman’s rank, Kendall’s Tau used for assessing for non-linear monotonic relationships… but that’s outside the scope of this class for now…))



To repeat the intuition in prose one last time: variance (or standard deviation) measures how much do values of \(X\) deviate from their own mean (\(\bar{X}\)). While covariance measures when \(X\) deviates from its mean, does \(Y\) deviate from its mean (\(\bar{Y}\)) in the same way? If we can avoid repeating the word “variance” too many times and think about the numerator in the cov() functino as “co-deviations” then we can understand the correlation coefficient equation as asking: “What proportion of the maximum possible co-deviation are we actually seeing in X and Y?”

What about \(R^2\) then?

Since:

  • \(r\) ranges from \([-1, 1]\),
  • and \(R^2 = r^2\) (literally here…)

Then \(R^2\) ranges between \(0\) and \(1\) and it tells us the proportion of the variation in \(Y\) that is explained by \(X\).


Since \(R^2\), like \(r\) is unitless we can use it to compare models directly (although be carefully trying to optimize for this as we explain below…)

A practice example

Let’s put this into practice by looking at a fake data set showing the hypothetical relationship between a person’s weight and the maximum distance they can jump… (thanks to Karthik Durvasula for this example)

Jumping ability

#Creating a fake data set with a 100 values each for jumps and weights.
set.seed(1001)
N <- 100
Subject <- 1:N

#Getting values
Weight <- 80 + rnorm(n=N, mean=0, sd=15) #The weight of the person.
Jump_m <- -0.05*Weight + 8 + rnorm(n=N, mean=0, sd=1) #The distances (in meters) that they jumped.

#Calculating the distance in feet
Jump_f <- Jump_m*3.28 #The distances (in feet) that they jumped.
Ability <- data.frame(Subject, Weight, Jump_m, Jump_f)


#Plotting the relationship
Ability %>%
  pivot_longer(Jump_m:Jump_f, names_to="Jump", values_to="Distance") %>%
  ggplot(aes(x=Weight, y=Distance))+
  geom_point()+
  geom_smooth(method="lm")+
  facet_grid(~Jump)
## `geom_smooth()` using formula = 'y ~ x'


If we calculate the covariances and correlations, the former change depending of whether the jump distances are in feet or meters (because they vary with the scale/measure used…), while the latter remain the same.

Question for you: what are the units that covariance is in here?



#covariance
cov(Ability$Weight,Ability$Jump_m)
## [1] -15.53582
cov(Ability$Weight,Ability$Jump_f)
## [1] -50.9575
# correlation
cor(Ability$Weight,Ability$Jump_m)
## [1] -0.748213
cor(Ability$Weight,Ability$Jump_f)
## [1] -0.748213
# In the case of simple linear regression R^2 is just r.... squared....
# Rsq <- cor(Ability$Weight,Ability$Jump_m)^2
# Rsq


Both the measures tell us about the direction of the relationship, but correlation gives us more information directly/interpretably about the strength of the relationship.

But, remember, both the values are telling about how far the points are from the line which might connect them, not about the slope of the line itself. This is where linear regression comes in.

Try doing this yourself here

Here is a data frame (from Bodo Winter) containing a representative lexicon of English wordforms along with their length (in phones) their \(log_{10}\) based corpus frequency and their mean RT on a lexical-decision task:

#
ELP.df <- read_csv("https://spencercaplan.org/teaching/F25/Statistics/data/ELP_length_frequency.csv", show_col_types = FALSE)
head(ELP.df, 10)
## # A tibble: 10 × 4
##    Word       Log10Freq length    RT
##    <chr>          <dbl>  <dbl> <dbl>
##  1 zenith         1.34       5  754.
##  2 zephyr         1.70       4  875.
##  3 zeroed         1.30       5  929.
##  4 zeros          1.70       5  625.
##  5 zest           1.54       4  659.
##  6 zigzag         1.36       6  785.
##  7 zigzagging     0.477      8  955.
##  8 zinc           1.63       4  624.
##  9 zing           1.78       3  810.
## 10 zip            2.59       3  587.

Your task is to do the following:

  1. First check the descriptive statistics of each column; intuitively assess whether there is skew in the data
  2. Using ggplot visualize the distribution of Log10Freq and RT using a violin + boxplot (hint: you probably want to pivot_longer() first)
  • You’ll notice this isn’t very helpful since the ranges of the two columns are quite different…. so next:
  1. Re-do your violin+boxplot from the previous step but as seperate facets (use facet_grid)
  • When you do this, we’ll see that both distributions have a slight skew (in fact a right skew) —- it doesn’t look too bad, but maybe this will become important later in our interpretation :)
  1. Assess the relationship between response duration (RT) and frequency by checking the correlation (you may use the built-in cor() for this, no need to do it by-hand)
  • If done correctly you should find that it’s approx. -0.62 – what does this mean in prose?
  • Convert this (by-hand this time…) to an \(R^2\) value. What is that and what does it mean?
  1. To go further, fit a linear model between the two variables and print it out
  • How can we interpret the coefficient on the Log10Freq predictor?
  1. Now make a scatterplot of the relationship between frequency and RT and impose the line of best fit (you can do that with geom_smooth(method="lm"))

One more thing here :)

There are some signs in the plot that this might not be the best model to fit. You can see that for the higher frequencies, the RTs are not in line with the line of best fit, and there is a slight curve, i.e., trying to fit a straight line on the relationship might not be appropriate. In fact, though frequency is what we think is at stake, the rank of the word may actually be more important. Perhaps speakers aren’t keeping track of frequency information per se but just a list of words in ranked order which they can search through… you can actually check this out very easily given this data and what we already know!

  1. Add one more column called WordRank which you can derive by calling rank(Log10Freq) and appending it to the df using mutate. Then check the \(R^2\) between rank and RT as well as making a new lm scatterplot which uses WordRank on the x-axis instead of Log10Freq (You’ll see that the \(R^2\) is ever so slightly higher when considering WordRank at 0.387 compared with Log10Freq 0.3834)

Outlier effects

Up to now we’ve, more or less, assumed that our data meets all the assumptions for applying linear models. But what happens when real data violates these assumptions?

Let’s now consider the last data set from lecture (from languageR) but instead looking at the correlation between the singular and plural forms of the nouns in question:

plot(ratings$FreqSingular,ratings$FreqPlural)
my.lm.2 <- lm(FreqPlural~FreqSingular,data=ratings)
abline(my.lm.2,col='red',lwd=2)

summary(my.lm.2)
## 
## Call:
## lm(formula = FreqPlural ~ FreqSingular, data = ratings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -153.00  -35.06  -21.66   12.31  277.40 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  31.62758    9.51381   3.324  0.00135 ** 
## FreqSingular  0.60182    0.03579  16.814  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 74.7 on 79 degrees of freedom
## Multiple R-squared:  0.7816, Adjusted R-squared:  0.7788 
## F-statistic: 282.7 on 1 and 79 DF,  p-value: < 2.2e-16

Question

Looking only at the above output from summary(my.lm.2) we can actually tell that the singular forms are more frequent than the corresponding plurals (on average). Why/how is that?


hint: What do the (Intercept) and \(\beta\) for FreqSingular tell us?


You’ll notice from the plot that there are four points whose frequency seems notably higher than the rest. What will happen if we remove those from consideration (I’ll filter out with a frequency that’s greater than 500)

plot(ratings$FreqSingular,ratings$FreqPlural)
abline(my.lm.2,col='red',lwd=2)
my.lm.3 <- lm(FreqPlural~FreqSingular,data=subset(ratings,FreqSingular <= 500))
abline(my.lm.3,col='blue',lty=2)

# we can quickly confirm that the frequency-based filtering did only remove four points by checking the lengths...
nrow(ratings)
## [1] 81
nrow(subset(ratings,FreqSingular <= 500))
## [1] 77

These two lines differ only in four data points. Yet the intercept and slope both change pretty substantially… This comparison illustrates a general problem for least squares linear regression: the presence of a few outlier data points can exercise an undue influence on the regression line due to the measure (squared error) used to fit the line.



What to do about this?

Although many people use outliers to refer to data points that are unusual in both the vertical and horizontal dimensions, sometimes you will hear people use “outlier” to refer only to points that are associated with very large residuals, and influential observation to refer to points that are very far away from most data points in the horizontal dimension. In this case, we have a number of influential observations (those greater than 500), and as the name suggests, they have an undue influence on the regression line.


One common approach to the presence of outlier data points is to simply remove them from the data, an approach that is most appropriate if you have independent grounds for believing those data points are not representative of the relationship you are investigating. Another approach is to adopt a transformation to reduce skew; in the present case, because we want to reduce the rightward skew of a few influential observations, taking the log transform of the predictor might help.

Multiple groups

In some cases, like the frequency to familiarity relationship, the linear relationship is clear. In other cases, not so much. Let’s consider the following linear model:

bad.lm <- lm(meanSizeRating ~ Frequency, data=ratings)
summary(bad.lm)
## 
## Call:
## lm(formula = meanSizeRating ~ Frequency, data = ratings)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.5110 -0.8628  0.1728  0.7880  1.6347 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.39234    0.38314   3.634 0.000495 ***
## Frequency    0.31842    0.07779   4.093 0.000102 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8825 on 79 degrees of freedom
## Multiple R-squared:  0.175,  Adjusted R-squared:  0.1645 
## F-statistic: 16.75 on 1 and 79 DF,  p-value: 0.0001019

Question

what does this printout tell us? What about the inferential stats here?


Why do you think I’d describe this as a very poor model despite the small \(p\)-value? (hint: what do you see in the distribution over residuals?)

Let’s see what the data look like when plotted:

ggplot(ratings,
       aes(x = Frequency, y = meanSizeRating)
       ) + 
  geom_point() +
  geom_smooth(method=lm) +
  theme_bw(base_size = 16)
## `geom_smooth()` using formula = 'y ~ x'

Looking further at the residuals directly…

qqnorm(resid(bad.lm))

shapiro.test(resid(bad.lm))
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(bad.lm)
## W = 0.93741, p-value = 0.0006441

Looking at the above Q-Q plot, what jumps out? What violations of our assumptions does this plot suggest? (reminiscent of the Q-Q plot on the midterm, no?)


The plot suggests that there are, in fact, two separate groups in the data that our model doesn’t account for. If this is correct, then our model is mis-specified and we’re violating the independence assumption in our errors. This situation will usually also lead to a violation of the normality assumption.

Splitting by group

We (I) actually know in this case that the top points are animal words, and the bottom points are plant words

ggplot(ratings,
       aes(x = Frequency, y = meanSizeRating, color=Class)
) + 
  geom_point(size=3) +
  scale_color_manual(values = c("animal" = "dodgerblue", 
                                "plant"  = "red")
  ) + 
  theme_bw(base_size = 16)


So what do we do about this? We might enrich our model by simply adding in class as another predictor into the model. This will create what is known as a multiple linear regression, where we have multiple predictors that combine linearly (additively) to predict our mean size ratings:

good.lm <- lm(meanSizeRating ~ Frequency + Class, data=ratings)
summary(good.lm)
## 
## Call:
## lm(formula = meanSizeRating ~ Frequency + Class, data = ratings)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.18828 -0.20840  0.02022  0.21250  0.89824 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.19113    0.19117  16.692   <2e-16 ***
## Frequency    0.09394    0.03561   2.638   0.0101 *  
## Classplant  -1.68952    0.09061 -18.647   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3802 on 78 degrees of freedom
## Multiple R-squared:  0.8488, Adjusted R-squared:  0.845 
## F-statistic:   219 on 2 and 78 DF,  p-value: < 2.2e-16
# anova(good.lm)


We might write the model here as:

\[ y_i = \alpha + \beta_{familiarity}*fam_i + \beta_{Class}*Class_i + \epsilon_i \]

One thing you might ask is “How do we multiply \(\beta_{Class}\) by the Class label, which is categorical?

The answer is that we associate each level of that factor with a number: you should know this already from lecture. It’s the contrast associated with a factor:

contrasts(ratings$Class)
##        plant
## animal     0
## plant      1

And we see that we have a simple treatment (dummy) coding of this factor for the time being.

How do we interpret the coefficients of this model, then?

When we inspect this model, we observe that \(t\)-tests associated with all three parameters are significant. Another way we can look at the status of different predictors is to look at an ANOVA table, which will calculate \(F\)-values to test whether it is useful to include each of our predictors:

anova(good.lm)
## Analysis of Variance Table
## 
## Response: meanSizeRating
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## Frequency  1 13.048  13.048  90.283 1.151e-14 ***
## Class      1 50.253  50.253 347.703 < 2.2e-16 ***
## Residuals 78 11.273   0.145                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

An important note: this ANOVA table represents a sequential ANOVA (sometimes called Type I sums of squares). That is, the \(F\)-tests reported here test the effect of adding in one predictor given all the others: does a model with Frequency fare better than the null model, and then does a model with Class and Frequency fare better than one with just Frequency?


If we entered the predictors into the model in a different order, the tests in the sequential ANOVA would be slightly different (unless the predictors are entirely orthogonal):

## Showing swapping the order of predictors here
rev.good.lm <-  lm(meanSizeRating ~  Class + Frequency, data=ratings)
anova(rev.good.lm)
## Analysis of Variance Table
## 
## Response: meanSizeRating
##           Df Sum Sq Mean Sq  F value  Pr(>F)    
## Class      1 62.295  62.295 431.0250 < 2e-16 ***
## Frequency  1  1.006   1.006   6.9606 0.01006 *  
## Residuals 78 11.273   0.145                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1


We can still visualize our multiple linear regression just as we did our simple linear regression:

ggplot(ratings,
       aes(x = Frequency, y = meanSizeRating, color=Class)
) + 
  geom_point(size=3) +
  geom_smooth(method=lm, se=F, fullrange=T) + 
  scale_color_manual(values = c("animal" = "dodgerblue", 
                                "plant"  = "red")
  ) + 
  theme_bw(base_size = 16)
## `geom_smooth()` using formula = 'y ~ x'

Notice that ggplot here automatically returns two regression lines (one for each group) that have slightly different slopes. There is a suggestion of an interaction of class and frequency: the effect of frequency on size ratings is less pronounced (shallower slope) for plants than it is for animals.

LOESS

Before worrying about that, something is still off about our fits, though. We are imposing straight lines on data that does not obviously appear to have a linear relationship. We can use a non-parametric smoother, which doesn’t make assumptions about the model that generated the data, to diagnose the nature of the relationship between familiarity and size ratings.

The most standard way to do this is something called locally estimated scatterplot smoothing LOESS – you basically calculate local regressions by sliding a small moving window across the data points.

ggplot(ratings,
       aes(x = Frequency, y = meanSizeRating, color=Class)
) + 
  geom_point(size=3) +
  geom_smooth(method=loess) + 
  scale_color_manual(values = c("animal" = "dodgerblue", 
                                "plant"  = "red")
  ) + 
  theme_bw(base_size = 16)
## `geom_smooth()` using formula = 'y ~ x'


Indeed, we see that size ratings seem to be a quadratic function of familiarity. We can account for this in the model by adding in a quadratic predictor, but this is plenty of stuff for now so more on this next time….