Why visualize?

“The simple graph has brought more information to the data analyst’s mind than any other device.” — John Tukey

“Visualization is a fundamentally human activity. A good visualization will show you things you did not expect or raise new questions about the data. A good visualization might also hint that you’re asking the wrong question or that you need to collect different data. Visualizations can surprise you, but they don’t scale particularly well because they require a human to interpret them.” – R4DS (R for Data Science)

Datasaurus dozen

The format of these data, for instance will be much more apparent visually than if you were to look at the individual x,y points in a table…

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

## Warning: package 'datasauRus' was built under R version 4.3.3

## # A tibble: 13 × 6
##    dataset    mean_x mean_y std_dev_x std_dev_y corr_x_y
##    <chr>       <dbl>  <dbl>     <dbl>     <dbl>    <dbl>
##  1 away         54.3   47.8      16.8      26.9  -0.0641
##  2 bullseye     54.3   47.8      16.8      26.9  -0.0686
##  3 circle       54.3   47.8      16.8      26.9  -0.0683
##  4 dino         54.3   47.8      16.8      26.9  -0.0645
##  5 dots         54.3   47.8      16.8      26.9  -0.0603
##  6 h_lines      54.3   47.8      16.8      26.9  -0.0617
##  7 high_lines   54.3   47.8      16.8      26.9  -0.0685
##  8 slant_down   54.3   47.8      16.8      26.9  -0.0690
##  9 slant_up     54.3   47.8      16.8      26.9  -0.0686
## 10 star         54.3   47.8      16.8      26.9  -0.0630
## 11 v_lines      54.3   47.8      16.8      26.9  -0.0694
## 12 wide_lines   54.3   47.8      16.8      26.9  -0.0666
## 13 x_shape      54.3   47.8      16.8      26.9  -0.0656

Some data

Here we have some subjective frequency ratings, ratings of estimated weight, and ratings of estimated size, averaged over subjects, for 81 concrete English nouns.

library(languageR)

## Warning: package 'languageR' was built under R version 4.3.3

str(ratings)

## 'data.frame':    81 obs. of  14 variables:
##  $ Word            : Factor w/ 81 levels "almond","ant",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Frequency       : num  4.2 5.35 6.3 3.83 3.66 ...
##  $ FamilySize      : num  0 1.39 1.1 0 0 ...
##  $ SynsetCount     : num  1.1 1.1 1.1 1.39 1.1 ...
##  $ Length          : int  6 3 5 7 9 7 6 6 3 6 ...
##  $ Class           : Factor w/ 2 levels "animal","plant": 2 1 2 2 2 2 1 2 1 1 ...
##  $ FreqSingular    : int  24 69 315 26 19 24 53 74 155 37 ...
##  $ FreqPlural      : int  42 140 231 19 19 6 78 77 103 14 ...
##  $ DerivEntropy    : num  0 0.562 0.496 0 0 ...
##  $ Complex         : Factor w/ 2 levels "complex","simplex": 2 2 2 2 2 2 2 2 2 2 ...
##  $ rInfl           : num  -0.542 -0.7 0.309 0.3 0 ...
##  $ meanWeightRating: num  1.49 3.35 2.19 1.32 1.44 ...
##  $ meanSizeRating  : num  1.89 3.63 2.47 1.76 1.87 ...
##  $ meanFamiliarity : num  3.72 3.6 5.84 4.4 3.68 4.12 2.12 5.68 3.2 2.2 ...

We will make use of the following variables:

Frequency - actual word frequency
meanFamiliarity - subjective frequency rating
Class - whether word is a plant or animal

## 'data.frame':    81 obs. of  4 variables:
##  $ Word           : Factor w/ 81 levels "almond","ant",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ Frequency      : num  4.2 5.35 6.3 3.83 3.66 ...
##  $ meanFamiliarity: num  3.72 3.6 5.84 4.4 3.68 4.12 2.12 5.68 3.2 2.2 ...
##  $ Class          : Factor w/ 2 levels "animal","plant": 2 1 2 2 2 2 1 2 1 1 ...

Today’s goal

Create this figure showing the relationship between actual frequency and subjective frequency rating of each word, considering the class the word belongs to

The basic ggplot

Using your data
define how variables in your dataset are mapped to visual properties (aesthetics)
determine the geometrical object that a plot uses to represent data (geom)

1 `data`

Use ratings data

ggplot(
    data = ratings
 )

Notice that we don’t see anything besides grey rectangle! We’ve added the data layer, but haven’t specified any of the other layers (aesthetics, geoms, statistics, coords, etc.)

So there’s nothing visible just yet

First the aesthetic mapping…

2 `aesthetic mapping`

Map Frequency to x-axis and meanFamiliarity to y-axis.

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 )

Notice that the data is there (how else did the ggplot object know that Frequency and meanFamiliarity were actually columsn in our ratings df? And what an approprate range for the x and y axes value should be?)

But we’re still not drawing any points – we didn’t specify a geom yet!

3 `geom`

Represent each value with a point.

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point()

There’s some kind of trend here, but not very discernable if all the points are in black and treated as atomic points.

Adding to the basics

Mapping categorical variables

When a categorical variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic (here color) … a process known as scaling. – R4DS

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity,
        color = Class
    )
 ) +
  geom_point()

Global vs. local `aesthetics`

globally in ggplot(), which are passed down to all geoms
locally in geom_*() which are used by that geom only

. . .

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class)
    )

Mapping vs. setting `aesthetics`

mapping allows us to determine a geom’s aesthetics based on a variable, and is passed as argument in aes()
setting allows us to set a geom’s aestheics to a constant value (not based on any variable), and passed as argument in geom_*() directly

. . .

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class), 
    size = 3
    )

Labels

`labels`: title and subtitle

Add title “Subjective frequency ratings” with subtitle “for 81 english nouns”

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class), 
    size = 3
    ) +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns"
  )

`labels`: x and y axis

Label x-axis “Actual frequency” and y-axis “Frequency rating”

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class), 
    size = 3
    ) +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating"
  )

Note these are mapped to the aesthetic (x and y)

`labels`: legend

Label the legend “word class”.

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class), 
    size = 3
    ) +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  )

Note these are mapped to the aesthetic as well (color)

Theme and adjusting scaling

`themes`

Apply classic theme with base_size 20.

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class), 
    size = 3
    ) +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) +
  theme_classic(base_size = 20)

`scales`: changing color

Remember: When a categorical variable is mapped to an aesthetic, ggplot2 will automatically assign a unique value of the aesthetic (here color) … a process known as scaling. – R4DS

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class), 
    size = 3
    ) +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) +
  theme_classic(base_size = 20) +
  scale_color_brewer(palette = "Paired")

Aesthetics

color

Map the color aesthetic to a variable

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class),
    size = 3
    ) +
 theme_classic(base_size = 20)

color

Set a constant value for the color aesthetic

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    color = "blue",
    size = 3
    ) +
 theme_classic(base_size = 20)

size

Setting a constant value for the size aesthetic

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(color = Class),
    size = 3
    ) +
 theme_classic(base_size = 20)

size

Mapped the size aesthetic to a variable

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(
        color = Class,
        size = Complex
    ),
    ) +
 theme_classic(base_size = 20)

## Warning: Using size for a discrete variable is not advised.

shape

Map the shape aesthetic to a different variable

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(
        color = Class,
        shape = Complex),
    size = 3
    ) +
 theme_classic(base_size = 20)

shape

Map the shape aesthetic to the same variable

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(
        color = Class,
        shape = Class),
    size = 3
    ) +
 theme_classic(base_size = 20)

alpha

Set a constant value for the alpha aesthetic

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(
        color = Class,
        shape = Class),
    alpha = 0.5,
    size = 3
    ) +
 theme_classic(base_size = 20)

alpha

Mapped to a variable

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    mapping = aes(
        color = Class,
        shape = Class,
        alpha = Length),
    size = 3
    ) +
 theme_classic(base_size = 20)

Geometric objects

`geom_*()` aka geoms

There are many. We will start with these, and add a few additional geoms later in the semester:

`geom_histogram()`	histogram, distribution of a continuous variable
`geom_density()`	distribution of a continuous variable
`geom_bar()`	distribution of a categorical data
`geom_point()`	scatterplot
`geom_smooth()`	smoothed line of best fit

`geom_histogram()`

A histogram divides the x-axis into equally spaced bins and then uses the height of a bar to display the number of observations that fall in each bin. – R4DS

. . .

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity
    )
) + 
    geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

`geom_histogram()`

bins - How many bins should we have?

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity
    )
) + 
    geom_histogram(
        bins = 10
    )

`geom_histogram()`

binwidth - How wide should the bins be?

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity
    )
) + 
    geom_histogram(
        binwidth = 0.25
    )

`geom_histogram()`

color - What should the outline color be?

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity
    )
) + 
    geom_histogram(
        binwidth = 0.25,
        color = "red"
    )

`geom_histogram()`

fill - What should the fill color be?

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity
    )
) + 
    geom_histogram(
        binwidth = 0.25,
        color = "red",
        fill = "lightblue"
    )

`geom_density()`

Imagine a histogram made out of wooden blocks. Then, imagine that you drop a cooked spaghetti string over it. The shape the spaghetti will take draped over blocks can be thought of as the shape of the density curve. – R4DS (now we’re doing science… :) )

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity
    )
) + 
    geom_density()

`geom_density()`

Map Class to color aesthetic

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity,
        color = Class
    )
) + 
    geom_density()

`geom_density()`

Set linewidth

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity,
        color = Class
    )
) + 
    geom_density(linewidth = 2)

`geom_density()`

Map Class to fill and set alpha

ggplot(
    data = ratings,
    mapping = aes(
        x = meanFamiliarity,
        fill = Class
    )
) + 
    geom_density(alpha = 0.5)

`geom_bar()`

To examine the distribution of a categorical variable, you can use a bar chart. The height of the bars displays how many observations occurred with each x value. – R4DS

ggplot(
    data = ratings,
    mapping = aes(
        x = Class
    )
) + 
    geom_bar()

`geom_bar()` - stacked

We can use stacked bar plots to visualize the relationship between two categorical variables

ggplot(
    data = ratings,
    mapping = aes(
        x = Class,
        fill = Complex
    )
) + 
    geom_bar()

`geom_bar()` - relative frequency

We can use relative frequency to visualize the relationship between two categorical variables (as a percentage)

ggplot(
    data = ratings,
    mapping = aes(
        x = Class,
        fill = Complex
    )
) + 
    geom_bar(position = "fill")

`geom_bar()` - dodged

We can use a dodged bar plot to visualize the relationship between two categorical variables side-by-side, not stacked

ggplot(
    data = ratings,
    mapping = aes(
        x = Class,
        fill = Complex
    )
) + 
    geom_bar(position = "dodge")

`geom_point()`

Scatterplots are useful for displaying the relationship between two numerical variables – R4DS

. . .

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    color = "blue", 
    size = 3
    ) +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) +
  theme_classic(base_size = 20)

`geom_point()` with `geom_smooth()`

draws a best fitting curve

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    color = "blue", 
    size = 3
    ) +
  geom_smooth() +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) +
  theme_classic(base_size = 20)

## `geom_smooth()` using method = 'loess' and formula = 'y ~ x'

`geom_point()` with `geom_smooth(method="lm")`

draws the best fitting linear model

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    color = "blue", 
    size = 3
    ) +
  geom_smooth(method="lm") +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) +
  theme_classic(base_size = 20)

## `geom_smooth()` using formula = 'y ~ x'

`geom_point()` with `geom_smooth(method="lm")`

We can also map to color, by specifying globally

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity,
        color = Class
    )
 ) +
  geom_point( 
    size = 3
    ) +
  geom_smooth(method="lm") +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) +
  theme_classic(base_size = 20)

## `geom_smooth()` using formula = 'y ~ x'

`geom_point()` with `geom_smooth(method="lm")`

Or include only a single smooth, by specifying color in the point geom only

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency, 
        y = meanFamiliarity
    )
 ) +
  geom_point( 
    aes(color = Class),
    size = 3
    ) +
  geom_smooth(method="lm") +
  labs(
    title = "Subjective frequency ratings", 
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) +
  theme_classic(base_size = 20)

## `geom_smooth()` using formula = 'y ~ x'

P4 (Visualization & ggplot2)

Spencer Caplan (adapted from material via Katie Schuler)

2025-10-09

Why visualize?

Datasaurus dozen

Some data

Today’s goal

The basic ggplot

1 data

2 aesthetic mapping

3 geom

Adding to the basics

Mapping categorical variables

Global vs. local aesthetics

Mapping vs. setting aesthetics

Labels

labels: title and subtitle

labels: x and y axis

labels: legend

Theme and adjusting scaling

themes

scales: changing color

Aesthetics

color

color

size

size

shape

shape

alpha

alpha

Geometric objects

geom_*() aka geoms

geom_histogram()

geom_histogram()

geom_histogram()

geom_histogram()

geom_histogram()

geom_density()

geom_density()

geom_density()

geom_density()

geom_bar()

geom_bar() - stacked

geom_bar() - relative frequency

geom_bar() - dodged

geom_point()

geom_point() with geom_smooth()

geom_point() with geom_smooth(method="lm")

geom_point() with geom_smooth(method="lm")

geom_point() with geom_smooth(method="lm")

We’ll resume with Facets next time…!

1 `data`

2 `aesthetic mapping`

3 `geom`

Global vs. local `aesthetics`

Mapping vs. setting `aesthetics`

`labels`: title and subtitle

`labels`: x and y axis

`labels`: legend

`themes`

`scales`: changing color

`geom_*()` aka geoms

`geom_histogram()`

`geom_histogram()`

`geom_histogram()`

`geom_histogram()`

`geom_histogram()`

`geom_density()`

`geom_density()`

`geom_density()`

`geom_density()`

`geom_bar()`

`geom_bar()` - stacked

`geom_bar()` - relative frequency

`geom_bar()` - dodged

`geom_point()`

`geom_point()` with `geom_smooth()`

`geom_point()` with `geom_smooth(method="lm")`

`geom_point()` with `geom_smooth(method="lm")`

`geom_point()` with `geom_smooth(method="lm")`