“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey
“Visualization
is a fundamentally human activity. A good
visualization will show you things you did not expect or raise new
questions about the data. A good visualization might also hint that
you’re asking the wrong question or that you need to collect different
data. Visualizations can surprise you, but they don’t scale particularly
well because they require a human to interpret them.” – R4DS (R for Data Science)
The format of these data, for instance will be much more apparent visually than if you were to look at the individual x,y points in a table…
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.3 ✔ tidyr 1.3.1
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'datasauRus' was built under R version 4.3.3
## # A tibble: 13 × 6
## dataset mean_x mean_y std_dev_x std_dev_y corr_x_y
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 away 54.3 47.8 16.8 26.9 -0.0641
## 2 bullseye 54.3 47.8 16.8 26.9 -0.0686
## 3 circle 54.3 47.8 16.8 26.9 -0.0683
## 4 dino 54.3 47.8 16.8 26.9 -0.0645
## 5 dots 54.3 47.8 16.8 26.9 -0.0603
## 6 h_lines 54.3 47.8 16.8 26.9 -0.0617
## 7 high_lines 54.3 47.8 16.8 26.9 -0.0685
## 8 slant_down 54.3 47.8 16.8 26.9 -0.0690
## 9 slant_up 54.3 47.8 16.8 26.9 -0.0686
## 10 star 54.3 47.8 16.8 26.9 -0.0630
## 11 v_lines 54.3 47.8 16.8 26.9 -0.0694
## 12 wide_lines 54.3 47.8 16.8 26.9 -0.0666
## 13 x_shape 54.3 47.8 16.8 26.9 -0.0656
Here we have some subjective frequency ratings, ratings of estimated weight, and ratings of estimated size, averaged over subjects, for 81 concrete English nouns.
library(languageR)
## Warning: package 'languageR' was built under R version 4.3.3
str(ratings)
## 'data.frame': 81 obs. of 14 variables:
## $ Word : Factor w/ 81 levels "almond","ant",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Frequency : num 4.2 5.35 6.3 3.83 3.66 ...
## $ FamilySize : num 0 1.39 1.1 0 0 ...
## $ SynsetCount : num 1.1 1.1 1.1 1.39 1.1 ...
## $ Length : int 6 3 5 7 9 7 6 6 3 6 ...
## $ Class : Factor w/ 2 levels "animal","plant": 2 1 2 2 2 2 1 2 1 1 ...
## $ FreqSingular : int 24 69 315 26 19 24 53 74 155 37 ...
## $ FreqPlural : int 42 140 231 19 19 6 78 77 103 14 ...
## $ DerivEntropy : num 0 0.562 0.496 0 0 ...
## $ Complex : Factor w/ 2 levels "complex","simplex": 2 2 2 2 2 2 2 2 2 2 ...
## $ rInfl : num -0.542 -0.7 0.309 0.3 0 ...
## $ meanWeightRating: num 1.49 3.35 2.19 1.32 1.44 ...
## $ meanSizeRating : num 1.89 3.63 2.47 1.76 1.87 ...
## $ meanFamiliarity : num 3.72 3.6 5.84 4.4 3.68 4.12 2.12 5.68 3.2 2.2 ...
We will make use of the following variables:
Frequency
- actual word frequencymeanFamiliarity
- subjective frequency ratingClass
- whether word is a plant or animal## 'data.frame': 81 obs. of 4 variables:
## $ Word : Factor w/ 81 levels "almond","ant",..: 1 2 3 4 5 6 7 8 9 10 ...
## $ Frequency : num 4.2 5.35 6.3 3.83 3.66 ...
## $ meanFamiliarity: num 3.72 3.6 5.84 4.4 3.68 4.12 2.12 5.68 3.2 2.2 ...
## $ Class : Factor w/ 2 levels "animal","plant": 2 1 2 2 2 2 1 2 1 1 ...
Create this figure showing the relationship between actual frequency and subjective frequency rating of each word, considering the class the word belongs to
data
aesthetics
)geom
)data
Use
ratings
data
ggplot(
data = ratings
)
Notice that we don’t see anything besides grey rectangle! We’ve added the data layer, but haven’t specified any of the other layers (aesthetics, geoms, statistics, coords, etc.)
So there’s nothing visible just yet
First the aesthetic mapping…
aesthetic mapping
Map
Frequency
to x-axis andmeanFamiliarity
to y-axis.
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
)
Notice that the data is there (how else did the
ggplot object know that Frequency
and
meanFamiliarity
were actually columsn in our ratings df?
And what an approprate range for the x and y axes value should be?)
But we’re still not drawing any points – we didn’t specify a geom yet!
geom
Represent each value with a point.
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point()
There’s some kind of trend here, but not very discernable if all the points are in black and treated as atomic points.
When a categorical variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic (here color) … a process known as
scaling
. – R4DS
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity,
color = Class
)
) +
geom_point()
aesthetics
ggplot()
, which are passed
down to all geomsgeom_*()
which are used by
that geom only. . .
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
mapping = aes(color = Class)
)
aesthetics
aes()
geom_*()
directly. . .
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
mapping = aes(color = Class),
size = 3
)
labels
: title and subtitleAdd title “Subjective frequency ratings” with subtitle “for 81 english nouns”
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
mapping = aes(color = Class),
size = 3
) +
labs(
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns"
)
labels
: x and y axisLabel x-axis “Actual frequency” and y-axis “Frequency rating”
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
mapping = aes(color = Class),
size = 3
) +
labs(
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns",
x = "Actual frequency",
y = "Frequency rating"
)
Note these are mapped to the aesthetic (x and y)
labels
: legendLabel the legend “word class”.
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
mapping = aes(color = Class),
size = 3
) +
labs(
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns",
x = "Actual frequency",
y = "Frequency rating",
color = "word class"
)
Note these are mapped to the aesthetic as well (color)
themes
Apply classic theme with base_size 20.
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
mapping = aes(color = Class),
size = 3
) +
labs(
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns",
x = "Actual frequency",
y = "Frequency rating",
color = "word class"
) +
theme_classic(base_size = 20)
scales
: changing colorRemember: When a categorical variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic (here color) … a process known as
scaling
. – R4DS
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
mapping = aes(color = Class),
size = 3
) +
labs(
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns",
x = "Actual frequency",
y = "Frequency rating",
color = "word class"
) +
theme_classic(base_size = 20) +
scale_color_brewer(palette = "Paired")
Map the color aesthetic to a variable
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
mapping = aes(color = Class),
size = 3
) +
theme_classic(base_size = 20)
Set a constant value for the color aesthetic
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
color = "blue",
size = 3
) +
theme_classic(base_size = 20)
Setting a constant value for the size aesthetic
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
mapping = aes(color = Class),
size = 3
) +
theme_classic(base_size = 20)
Mapped the size aesthetic to a variable
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
mapping = aes(
color = Class,
size = Complex
),
) +
theme_classic(base_size = 20)
## Warning: Using size for a discrete variable is not advised.
Map the shape aesthetic to a different variable
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
mapping = aes(
color = Class,
shape = Complex),
size = 3
) +
theme_classic(base_size = 20)
Map the shape aesthetic to the same variable
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
mapping = aes(
color = Class,
shape = Class),
size = 3
) +
theme_classic(base_size = 20)
Set a constant value for the alpha aesthetic
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
mapping = aes(
color = Class,
shape = Class),
alpha = 0.5,
size = 3
) +
theme_classic(base_size = 20)
Mapped to a variable
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
mapping = aes(
color = Class,
shape = Class,
alpha = Length),
size = 3
) +
theme_classic(base_size = 20)
geom_*()
aka geomsThere are many. We will start with these, and add a few additional geoms later in the semester:
geom_histogram() |
histogram, distribution of a continuous variable |
geom_density() |
distribution of a continuous variable |
geom_bar() |
distribution of a categorical data |
geom_point() |
scatterplot |
geom_smooth() |
smoothed line of best fit |
geom_histogram()
A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. – R4DS
. . .
ggplot(
data = ratings,
mapping = aes(
x = meanFamiliarity
)
) +
geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
geom_histogram()
bins - How many bins should we have?
ggplot(
data = ratings,
mapping = aes(
x = meanFamiliarity
)
) +
geom_histogram(
bins = 10
)
geom_histogram()
binwidth - How wide should the bins be?
ggplot(
data = ratings,
mapping = aes(
x = meanFamiliarity
)
) +
geom_histogram(
binwidth = 0.25
)
geom_histogram()
color - What should the outline color be?
ggplot(
data = ratings,
mapping = aes(
x = meanFamiliarity
)
) +
geom_histogram(
binwidth = 0.25,
color = "red"
)
geom_histogram()
fill - What should the fill color be?
ggplot(
data = ratings,
mapping = aes(
x = meanFamiliarity
)
) +
geom_histogram(
binwidth = 0.25,
color = "red",
fill = "lightblue"
)
geom_density()
Imagine a histogram made out of wooden blocks. Then, imagine that you drop a cooked spaghetti string over it. The shape the spaghetti will take draped over blocks can be thought of as the shape of the density curve. – R4DS (now we’re doing science… :) )
ggplot(
data = ratings,
mapping = aes(
x = meanFamiliarity
)
) +
geom_density()
geom_density()
Map Class to color aesthetic
ggplot(
data = ratings,
mapping = aes(
x = meanFamiliarity,
color = Class
)
) +
geom_density()
geom_density()
Set linewidth
ggplot(
data = ratings,
mapping = aes(
x = meanFamiliarity,
color = Class
)
) +
geom_density(linewidth = 2)
geom_density()
Map Class to fill and set alpha
ggplot(
data = ratings,
mapping = aes(
x = meanFamiliarity,
fill = Class
)
) +
geom_density(alpha = 0.5)
geom_bar()
To examine the distribution of a categorical variable, you can use a bar chart. The height of the bars displays how many observations occurred with each x value. – R4DS
ggplot(
data = ratings,
mapping = aes(
x = Class
)
) +
geom_bar()
geom_bar()
- stackedWe can use stacked bar plots to visualize the relationship between two categorical variables
ggplot(
data = ratings,
mapping = aes(
x = Class,
fill = Complex
)
) +
geom_bar()
geom_bar()
- relative frequencyWe can use relative frequency to visualize the relationship between two categorical variables (as a percentage)
ggplot(
data = ratings,
mapping = aes(
x = Class,
fill = Complex
)
) +
geom_bar(position = "fill")
geom_bar()
- dodgedWe can use a dodged bar plot to visualize the relationship between two categorical variables side-by-side, not stacked
ggplot(
data = ratings,
mapping = aes(
x = Class,
fill = Complex
)
) +
geom_bar(position = "dodge")
geom_point()
Scatterplots are useful for displaying the relationship between two numerical variables – R4DS
. . .
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
color = "blue",
size = 3
) +
labs(
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns",
x = "Actual frequency",
y = "Frequency rating",
color = "word class"
) +
theme_classic(base_size = 20)
geom_point()
with
geom_smooth()
draws a best fitting curve
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
color = "blue",
size = 3
) +
geom_smooth() +
labs(
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns",
x = "Actual frequency",
y = "Frequency rating",
color = "word class"
) +
theme_classic(base_size = 20)
## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'
geom_point()
with
geom_smooth(method="lm")
draws the best fitting linear model
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
color = "blue",
size = 3
) +
geom_smooth(method="lm") +
labs(
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns",
x = "Actual frequency",
y = "Frequency rating",
color = "word class"
) +
theme_classic(base_size = 20)
## `geom_smooth()` using formula = 'y ~ x'
geom_point()
with
geom_smooth(method="lm")
We can also map to color, by specifying globally
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity,
color = Class
)
) +
geom_point(
size = 3
) +
geom_smooth(method="lm") +
labs(
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns",
x = "Actual frequency",
y = "Frequency rating",
color = "word class"
) +
theme_classic(base_size = 20)
## `geom_smooth()` using formula = 'y ~ x'
geom_point()
with
geom_smooth(method="lm")
Or include only a single smooth, by specifying color in the point geom only
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
aes(color = Class),
size = 3
) +
geom_smooth(method="lm") +
labs(
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns",
x = "Actual frequency",
y = "Frequency rating",
color = "word class"
) +
theme_classic(base_size = 20)
## `geom_smooth()` using formula = 'y ~ x'