We ended up going a little fast at the end of lecture through an important topic, namely, what the pearson correlation actually is so I wanted to walk through it carefully.
But that first requires us to understand covariance
The sample covariance between two variables \(X\) and \(Y\) is calculated as:
\[
\text{cov}(X,Y) = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i -
\bar{y})}{n-1}
\]
Where:
(we’re dividing by \(n-1\) as usual as part of Bessel’s Correction).
In prose: the numerator is iterating over each of the data points in our sample and measuring the deviation between that point and the sample mean. The denominator is then normalizing so that we get the average deviation per point in our data set.
We can confirm this with a little intuition pushing by deriving the
covariance manually, and confirming this aligns with the output from the
build in R function cov() (which simply calculates the same
thing for us)
# Let's take an actual straight line to start
x <- c(1, 2, 3, 4, 5)
y <- c(2, 4, 6, 8, 10)
x.bar <- mean(x) # 3
y.bar <- mean(y) # 6
deviations <- (x - x.bar) * (y - y.bar)
# (-2*-4) + (-1*-2) + (0*0) + (1*2) + (2*4)
# = 8 + 2 + 0 + 2 + 8 = 20
numerator <- sum(deviations)
denom <- (length(x) - 1)
cov_manual <- numerator / denom # 20/4 = 5
# Verify with cov()
cov(x, y) # 5
## [1] 5
The \(x\) calculation is asking how far each x point is from the sample \(\bar{x}\) mean and the \(y\) calculation is asking the same thing but for y. And then the covariance is just asking whether those deviations change together or not!
# Try out the above by flipping y to be negative....
x <- c(1, 2, 3, 4, 5)
z <- -1 * y
deviations <- (x - mean(x)) * (z - mean(z) )
cov_manual <- sum(deviations) / (length(x) - 1) # -20 / 4 = -5
cov_manual
## [1] -5
cov(x, z) # -5
## [1] -5
Covariance scales with the size of the original input variables. i.e., if we double all the values in x and y here then see how the covarariance changes quadratically…
cov(2*x, 2*y)
## [1] 20
cov(3*x, 3*y)
## [1] 45
cov(4*x, 4*y)
## [1] 80
We could write this as a general form if we wanted:
\[ cov(a*X, b*Y) = a * b * cov(X,Y) \]
This scaling property reflects the units that covariance is in: units of \(x\) * units of \(y\)….!
So if above the RV \(x\) is measured
in kilorgams, while \(y\) is in meters,
then cov(x,y) is in the unit: “kilogram-meters”
This often makes covariance difficult to interpret on its own….
This motivates our turn to correlation. Specifically what’s known as Pearson’s product moment correlation or the Pearson correlation coefficient which is defined as follows:
\[ r = \frac{\text{cov}(X,Y)}{\sigma_X \cdot \sigma_Y} = \frac{\sum_{i=1}^{n}(x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{n}(x_i - \bar{x})^2} \cdot \sqrt{\sum_{i=1}^{n}(y_i - \bar{y})^2}} \] The denominator here (\(\sigma_X \cdot \sigma_Y\)) acts as a normalizing factor that puts a mathematical ceiling on how large the covariance can be relative to the individual variability of X and Y. (…and the sample size in the denominator of each sub-fraction cancels each other out…)
So while cov() can take any value, the correlation
(\(r\)) standardizes it by the maximum
possible covariance given the variability in each variable
individually.
In fact (and this is in Spencer’s favorite-statistics-facts-Hall-of-Fame): Variance (\(sigma^2\)) is actually, mathematically, just the covariance of a variable with itself!
\[ \begin{aligned} \text{var}(X) &= \text{cov}(X,X) \\ &= \frac{\sum_{i=1}^{n}(x_i - \bar{x})(x_i - \bar{x})}{n-1} \\ &= \frac{\sum_{i=1}^{n}(x_i - \bar{x})^2}{n-1} \end{aligned} \]
THIS is the bridge (to me at least) which explains why we can normalize covariance by dividing by the standard deviation. In doing so we normalize covariance into a unitless variable which ranges between \([-1, 1]\)
Side note
The reason the (Product-moment) correlation coefficient \(r\) has this name: we’re multiplying things together (hence “product”) and the things we multiple are a type of what’s known as a “moment” on the math side of things. Specifically, measures of central tendancy like the mean are the “first moment” while measures of spread like variance or standard deviation are the “second moment”.
(There are other types of correlation, e.g., Spearman’s rank, Kendall’s Tau used for assessing for non-linear monotonic relationships… but that’s outside the scope of this class for now…))
To repeat the intuition in prose one last time: variance (or
standard deviation) measures how much do values of \(X\) deviate from their own mean (\(\bar{X}\)). While covariance
measures when \(X\) deviates from its
mean, does \(Y\) deviate from its mean
(\(\bar{Y}\)) in the same way? If we
can avoid repeating the word “variance” too many times and think about
the numerator in the cov() functino as
“co-deviations” then we can understand the correlation
coefficient equation as asking: “What proportion of the maximum
possible co-deviation are we actually seeing in X and Y?”
Since:
Then \(R^2\) ranges between \(0\) and \(1\) and it tells us the proportion of the variation in \(Y\) that is explained by \(X\).
Since \(R^2\), like \(r\) is unitless we can use it to compare models directly (although be carefully trying to optimize for this as we explain below…)
Let’s put this into practice by looking at a fake data set showing the hypothetical relationship between a person’s weight and the maximum distance they can jump… (thanks to Karthik Durvasula for this example)
#Creating a fake data set with a 100 values each for jumps and weights.
set.seed(1001)
N <- 100
Subject <- 1:N
#Getting values
Weight <- 80 + rnorm(n=N, mean=0, sd=15) #The weight of the person.
Jump_m <- -0.05*Weight + 8 + rnorm(n=N, mean=0, sd=1) #The distances (in meters) that they jumped.
#Calculating the distance in feet
Jump_f <- Jump_m*3.28 #The distances (in feet) that they jumped.
Ability <- data.frame(Subject, Weight, Jump_m, Jump_f)
#Plotting the relationship
Ability %>%
pivot_longer(Jump_m:Jump_f, names_to="Jump", values_to="Distance") %>%
ggplot(aes(x=Weight, y=Distance))+
geom_point()+
geom_smooth(method="lm")+
facet_grid(~Jump)
## `geom_smooth()` using formula = 'y ~ x'
If we calculate the covariances and correlations, the former
change depending of whether the jump distances are in feet or meters
(because they vary with the scale/measure used…), while the latter
remain the same.
Question for you: what are the units that covariance is in here?
#covariance
cov(Ability$Weight,Ability$Jump_m)
## [1] -15.53582
cov(Ability$Weight,Ability$Jump_f)
## [1] -50.9575
# correlation
cor(Ability$Weight,Ability$Jump_m)
## [1] -0.748213
cor(Ability$Weight,Ability$Jump_f)
## [1] -0.748213
# In the case of simple linear regression R^2 is just r.... squared....
# Rsq <- cor(Ability$Weight,Ability$Jump_m)^2
# Rsq
Both the measures tell us about the direction of the relationship, but correlation gives us more information directly/interpretably about the strength of the relationship.
But, remember, both the values are telling about how far the points are from the line which might connect them, not about the slope of the line itself. This is where linear regression comes in.
Here is a data frame (from Bodo Winter) containing a representative lexicon of English wordforms along with their length (in phones) their \(log_{10}\) based corpus frequency and their mean RT on a lexical-decision task:
#
ELP.df <- read_csv("https://spencercaplan.org/teaching/F25/Statistics/data/ELP_length_frequency.csv", show_col_types = FALSE)
head(ELP.df, 10)
## # A tibble: 10 × 4
## Word Log10Freq length RT
## <chr> <dbl> <dbl> <dbl>
## 1 zenith 1.34 5 754.
## 2 zephyr 1.70 4 875.
## 3 zeroed 1.30 5 929.
## 4 zeros 1.70 5 625.
## 5 zest 1.54 4 659.
## 6 zigzag 1.36 6 785.
## 7 zigzagging 0.477 8 955.
## 8 zinc 1.63 4 624.
## 9 zing 1.78 3 810.
## 10 zip 2.59 3 587.
Your task is to do the following:
ggplot visualize the distribution of
Log10Freq and RT using a violin + boxplot
(hint: you probably want to pivot_longer() first)facet_grid)cor()
for this, no need to do it by-hand)-0.62 – what does this mean in prose?Log10Freq
predictor?geom_smooth(method="lm"))One more thing here :)
There are some signs in the plot that this might not be the best model to fit. You can see that for the higher frequencies, the RTs are not in line with the line of best fit, and there is a slight curve, i.e., trying to fit a straight line on the relationship might not be appropriate. In fact, though frequency is what we think is at stake, the rank of the word may actually be more important. Perhaps speakers aren’t keeping track of frequency information per se but just a list of words in ranked order which they can search through… you can actually check this out very easily given this data and what we already know!
WordRank which you can
derive by calling rank(Log10Freq) and appending it to the
df using mutate. Then check the \(R^2\) between rank and RT as well as making
a new lm scatterplot which uses WordRank on
the x-axis instead of Log10Freq (You’ll see that the \(R^2\) is ever so slightly higher when
considering WordRank at 0.387 compared with
Log10Freq 0.3834)Up to now we’ve, more or less, assumed that our data meets all the assumptions for applying linear models. But what happens when real data violates these assumptions?
Let’s now consider the last data set from lecture (from
languageR) but instead looking at the correlation between
the singular and plural forms of the nouns in question:
plot(ratings$FreqSingular,ratings$FreqPlural)
my.lm.2 <- lm(FreqPlural~FreqSingular,data=ratings)
abline(my.lm.2,col='red',lwd=2)
summary(my.lm.2)
##
## Call:
## lm(formula = FreqPlural ~ FreqSingular, data = ratings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -153.00 -35.06 -21.66 12.31 277.40
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 31.62758 9.51381 3.324 0.00135 **
## FreqSingular 0.60182 0.03579 16.814 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 74.7 on 79 degrees of freedom
## Multiple R-squared: 0.7816, Adjusted R-squared: 0.7788
## F-statistic: 282.7 on 1 and 79 DF, p-value: < 2.2e-16
Question
Looking only at the above output from
summary(my.lm.2) we can actually tell that the singular
forms are more frequent than the corresponding plurals (on average).
Why/how is that?
hint: What do the (Intercept) and
\(\beta\) for FreqSingular
tell us?
You’ll notice from the plot that there are four points whose
frequency seems notably higher than the rest. What will happen if we
remove those from consideration (I’ll filter out with a frequency that’s
greater than 500)
plot(ratings$FreqSingular,ratings$FreqPlural)
abline(my.lm.2,col='red',lwd=2)
my.lm.3 <- lm(FreqPlural~FreqSingular,data=subset(ratings,FreqSingular <= 500))
abline(my.lm.3,col='blue',lty=2)
# we can quickly confirm that the frequency-based filtering did only remove four points by checking the lengths...
nrow(ratings)
## [1] 81
nrow(subset(ratings,FreqSingular <= 500))
## [1] 77
These two lines differ only in four data points. Yet the intercept and slope both change pretty substantially… This comparison illustrates a general problem for least squares linear regression: the presence of a few outlier data points can exercise an undue influence on the regression line due to the measure (squared error) used to fit the line.
Although many people use outliers to refer to data points that are
unusual in both the vertical and horizontal dimensions, sometimes you
will hear people use “outlier” to refer only to points that are
associated with very large residuals, and influential
observation to refer to points that are very far away from most
data points in the horizontal dimension. In this case, we have a number
of influential observations (those greater than 500), and
as the name suggests, they have an undue influence on the regression
line.
One common approach to the presence of outlier data points is to simply remove them from the data, an approach that is most appropriate if you have independent grounds for believing those data points are not representative of the relationship you are investigating. Another approach is to adopt a transformation to reduce skew; in the present case, because we want to reduce the rightward skew of a few influential observations, taking the log transform of the predictor might help.
In some cases, like the frequency to familiarity relationship, the linear relationship is clear. In other cases, not so much. Let’s consider the following linear model:
bad.lm <- lm(meanSizeRating ~ Frequency, data=ratings)
summary(bad.lm)
##
## Call:
## lm(formula = meanSizeRating ~ Frequency, data = ratings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.5110 -0.8628 0.1728 0.7880 1.6347
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.39234 0.38314 3.634 0.000495 ***
## Frequency 0.31842 0.07779 4.093 0.000102 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.8825 on 79 degrees of freedom
## Multiple R-squared: 0.175, Adjusted R-squared: 0.1645
## F-statistic: 16.75 on 1 and 79 DF, p-value: 0.0001019
Question
what does this printout tell us? What about the inferential stats here?
Why do you think I’d describe this as a very poor model despite
the small \(p\)-value?
(hint: what do you see in the distribution over
residuals?)
Let’s see what the data look like when plotted:
ggplot(ratings,
aes(x = Frequency, y = meanSizeRating)
) +
geom_point() +
geom_smooth(method=lm) +
theme_bw(base_size = 16)
## `geom_smooth()` using formula = 'y ~ x'
Looking further at the residuals directly…
qqnorm(resid(bad.lm))
shapiro.test(resid(bad.lm))
##
## Shapiro-Wilk normality test
##
## data: resid(bad.lm)
## W = 0.93741, p-value = 0.0006441
Looking at the above Q-Q plot, what jumps out? What violations of our assumptions does this plot suggest? (reminiscent of the Q-Q plot on the midterm, no?)
The plot suggests that there are, in fact, two separate groups in the data that our model doesn’t account for. If this is correct, then our model is mis-specified and we’re violating the independence assumption in our errors. This situation will usually also lead to a violation of the normality assumption.
We (I) actually know in this case that the top points are animal words, and the bottom points are plant words
ggplot(ratings,
aes(x = Frequency, y = meanSizeRating, color=Class)
) +
geom_point(size=3) +
scale_color_manual(values = c("animal" = "dodgerblue",
"plant" = "red")
) +
theme_bw(base_size = 16)
So what do we do about this? We might enrich our model by simply
adding in class as another predictor into the model. This
will create what is known as a multiple linear
regression, where we have multiple predictors that combine
linearly (additively) to predict our mean size ratings:
good.lm <- lm(meanSizeRating ~ Frequency + Class, data=ratings)
summary(good.lm)
##
## Call:
## lm(formula = meanSizeRating ~ Frequency + Class, data = ratings)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.18828 -0.20840 0.02022 0.21250 0.89824
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.19113 0.19117 16.692 <2e-16 ***
## Frequency 0.09394 0.03561 2.638 0.0101 *
## Classplant -1.68952 0.09061 -18.647 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.3802 on 78 degrees of freedom
## Multiple R-squared: 0.8488, Adjusted R-squared: 0.845
## F-statistic: 219 on 2 and 78 DF, p-value: < 2.2e-16
# anova(good.lm)
We might write the model here as:
\[ y_i = \alpha + \beta_{familiarity}*fam_i + \beta_{Class}*Class_i + \epsilon_i \]
One thing you might ask is “How do we multiply \(\beta_{Class}\) by the Class label, which is categorical?
The answer is that we associate each level of that factor with a number: you should know this already from lecture. It’s the contrast associated with a factor:
contrasts(ratings$Class)
## plant
## animal 0
## plant 1
And we see that we have a simple treatment (dummy) coding of this factor for the time being.
How do we interpret the coefficients of this model, then?
When we inspect this model, we observe that \(t\)-tests associated with all three parameters are significant. Another way we can look at the status of different predictors is to look at an ANOVA table, which will calculate \(F\)-values to test whether it is useful to include each of our predictors:
anova(good.lm)
## Analysis of Variance Table
##
## Response: meanSizeRating
## Df Sum Sq Mean Sq F value Pr(>F)
## Frequency 1 13.048 13.048 90.283 1.151e-14 ***
## Class 1 50.253 50.253 347.703 < 2.2e-16 ***
## Residuals 78 11.273 0.145
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
An important note: this ANOVA table represents a
sequential ANOVA (sometimes called Type I sums of
squares). That is, the \(F\)-tests
reported here test the effect of adding in one predictor given all the
others: does a model with Frequency fare better than the
null model, and then does a model with Class and
Frequency fare better than one with just
Frequency?
If we entered the predictors into the model in a different order, the tests in the sequential ANOVA would be slightly different (unless the predictors are entirely orthogonal):
## Showing swapping the order of predictors here
rev.good.lm <- lm(meanSizeRating ~ Class + Frequency, data=ratings)
anova(rev.good.lm)
## Analysis of Variance Table
##
## Response: meanSizeRating
## Df Sum Sq Mean Sq F value Pr(>F)
## Class 1 62.295 62.295 431.0250 < 2e-16 ***
## Frequency 1 1.006 1.006 6.9606 0.01006 *
## Residuals 78 11.273 0.145
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can still visualize our multiple linear regression just as we did our simple linear regression:
ggplot(ratings,
aes(x = Frequency, y = meanSizeRating, color=Class)
) +
geom_point(size=3) +
geom_smooth(method=lm, se=F, fullrange=T) +
scale_color_manual(values = c("animal" = "dodgerblue",
"plant" = "red")
) +
theme_bw(base_size = 16)
## `geom_smooth()` using formula = 'y ~ x'
Notice that ggplot here automatically returns two
regression lines (one for each group) that have slightly different
slopes. There is a suggestion of an interaction of
class and frequency: the effect of frequency on size ratings is less
pronounced (shallower slope) for plants than it is for animals.
Before worrying about that, something is still off about our fits, though. We are imposing straight lines on data that does not obviously appear to have a linear relationship. We can use a non-parametric smoother, which doesn’t make assumptions about the model that generated the data, to diagnose the nature of the relationship between familiarity and size ratings.
The most standard way to do this is something called locally estimated scatterplot smoothing LOESS – you basically calculate local regressions by sliding a small moving window across the data points.
ggplot(ratings,
aes(x = Frequency, y = meanSizeRating, color=Class)
) +
geom_point(size=3) +
geom_smooth(method=loess) +
scale_color_manual(values = c("animal" = "dodgerblue",
"plant" = "red")
) +
theme_bw(base_size = 16)
## `geom_smooth()` using formula = 'y ~ x'
Indeed, we see that size ratings seem to be a
quadratic function of familiarity. We can account for
this in the model by adding in a quadratic predictor, but this
is plenty of stuff for now so more on this next time….