Picking up from last time

We ended last time before covering how Facets work so let’s do that now.

Facets

smaller plots that display different subsets of data

`facet_grid()`

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency,
        y = meanFamiliarity
    )
 ) +
  geom_point() +
  facet_grid(Class ~ Complex) +
  theme_classic(base_size = 20)

. . .

Compare with the same data, viewed with two aesthetics (color and shape)

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency,
        y = meanFamiliarity
    )
 ) +
  geom_point(
    aes(color = Class, shape = Complex)
  ) +
  theme_classic(base_size = 20)

`facet_grid()` - just columns

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency,
        y = meanFamiliarity
    )
 ) +
  geom_point() +
  facet_grid(. ~ Complex) +
  theme_classic(base_size = 20)

`facet_grid()` - just columns

and note we can still map other aesthetics!

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency,
        y = meanFamiliarity
    )
 ) +
  geom_point(
    aes(color = Class),
    shape = "triangle"
  ) +
  facet_grid(. ~ Complex) +
  theme_classic(base_size = 20)

`facet_grid()` - just rows

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency,
        y = meanFamiliarity
    )
 ) +
  geom_point() +
  facet_grid(Class ~ .) +
  theme_classic(base_size = 20)

`facet_wrap()`

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency,
        y = meanFamiliarity
    )
 ) +
  geom_point() +
  facet_wrap(~ Class) +
  theme_classic(base_size = 20)

`facet_wrap()` - number of columns

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency,
        y = meanFamiliarity
    )
 ) +
  geom_point() +
  facet_wrap(~ Class, ncol = 1) +
  theme_classic(base_size = 20)

Helper functions

remember our goal plot from last week?

Now we have the pieces to make it ourselves.

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency,
        y = meanFamiliarity
    )
 ) +
  geom_point(
    mapping = aes(color = Class),
    size = 3
    ) +
  labs(
    title = "Subjective frequency ratings",
    subtitle = "for 81 english nouns",
    x = "Actual frequency",
    y = "Frequency rating",
    color = "word class"
  ) +
  theme_classic(base_size = 20) +
  scale_color_brewer(palette = "Paired")

`last_plot()`

returns the last plot

last_plot()

`ggsave()`

saves last plot

ggsave("plot.png", width=5, height=5)

Themes

ggplot comes with many Complete themes

Default theme

last_plot() + theme_gray(base_size=20)

Sample themes

last_plot() + theme_bw(base_size=20)

last_plot() + theme_classic(base_size=20)

last_plot() + theme_minimal(base_size=20)

last_plot() + theme_void(base_size=20)

Custom themes

I (Spencer) have found that I keep returning to the same sorts of visualization over and over across different research projects and rather than repeat code it becomes useful to define my own theme and apply that (or variants) when I need it. You’ll see an example of that on HW2.

Shortcuts

ggplot2 calls

Explicit argument names:

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency,
        y = meanFamiliarity
    )
) +
   geom_point()

Implied argument names:

ggplot(
    ratings,
    aes(
        x = Frequency,
        y = meanFamiliarity
    )
) +
   geom_point()

the pipe `%>%`

the pipe takes the thing on its left and passes it along to the function on its right

. . .

x %>% f(y) is equivalent to f(x, y)

. . .

library(magrittr)

## 
## Attaching package: 'magrittr'

## The following object is masked from 'package:purrr':
## 
##     set_names

## The following object is masked from 'package:tidyr':
## 
##     extract

x <- c(1.0, 2.245, 3, 4.22222)
x

## [1] 1.00000 2.24500 3.00000 4.22222

# pass x as an argument to function usual way
round(x, digits = 2)

## [1] 1.00 2.24 3.00 4.22

. . .

# pass x as an argument to function with pipe
x %>% round(digits = 2)

## [1] 1.00 2.24 3.00 4.22

There are two ways to write the pipe: %>% or |>

the pipe `%>%` and ggplot

Implied argument names:

ggplot(
    data = ratings,
    mapping = aes(
        x = Frequency,
        y = meanFamiliarity
    )
) +
   geom_point()

Implied argument names + pipe:

ratings %>%
ggplot(
    aes(
        x = Frequency,
        y = meanFamiliarity
    )
) +
   geom_point()

Note that we pipe %>% in arguments to functions but we ADD + layers to ggplot. Common mistake!

Now that’s in place let’s return to getting some data in place – Tidyverse

The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.

Tidyverse package docs

Tidyverse

ggplot2 - for data visualization
dplyr - for data wrangling
readr - for reading data
tibble - for modern data frames
stringr: for string manipulation
forcats: for dealing with factors
tidyr: for data tidying
purrr: for functional programming

Tidyverse hex logos from www.tidyverse.org

Loading the tidyverse

You just need to be use to run / include “library(tidyverse)” at the top of your script

Tidy data

Tidyverse makes use of tidy data, a standard way of structuring datasets:

each variable forms a column; each column forms a variable
each observation forms a row; each row forms an observation
value is a cell; each cell is a single value

Tidy data

Visual of tidy data rules, from R for Data Science

Why tidy data?

Because consistency and uniformity are very helpful when programming
Variables as columns works well for vectorized languages (R!)

`purr`

Functional programming

to illustrate the joy of tidyverse and tidy data

purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read.

purrr docs

The `map_*()` functions

Take a vector as input
Apply a function to each element
Return a new vector

The `map_*()` functions

We say “functions” because there are 5, one for each type of vector:

map() - list
map_lgl() - logical
map_int() - integer
map_dbl() - double
map_chr() - character

`map` use case

df <- data.frame(
    x = 1:10,
    y = 11:20,
    z = 21:30
)

with copy+paste

mean(df$x)

## [1] 5.5

mean(df$y)

## [1] 15.5

mean(df$z)

## [1] 25.5

with map

map(df, mean)

## $x
## [1] 5.5
## 
## $y
## [1] 15.5
## 
## $z
## [1] 25.5

`tibble`

modern data frames

A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less and complain more

tibble docs

Tibbles do less than data frames, in a good way:

never changes type of input (never converts strings to factors!)
never changes the name of variables
only recycles vectors of length 1
never creates row names

The take-away is that data.frame and tibble sometimes behave differently. The behavior of tibble makes more sense for modern data science, so we should us it instead!

Create a `tibble`

Coerce an existing object:

df <- data.frame(
    x = 1:4,
    y = c("a", "b", "c", "d")
)
as_tibble(df)

## # A tibble: 4 × 2
##       x y    
##   <int> <chr>
## 1     1 a    
## 2     2 b    
## 3     3 c    
## 4     4 d

Pass a column of vectors:

tibble(
    x = 1:4,
    y = c("a", "b", "c", "d")
)

## # A tibble: 4 × 2
##       x y    
##   <int> <chr>
## 1     1 a    
## 2     2 b    
## 3     3 c    
## 4     4 d

Define row-by-row (see here):

tribble(
    ~x, ~y,
    "a", 1,
    "b", 2,
    "c", 3,
    "d", 4 
)

## # A tibble: 4 × 2
##   x         y
##   <chr> <dbl>
## 1 a         1
## 2 b         2
## 3 c         3
## 4 d         4

Test if `tibble`

With is_tibble(x) and is.data.frame(x)

Data frame:

df <- data.frame(
    x = 1:4,
    y = c("a", "b", "c", "d")
)

is_tibble(df)

## [1] FALSE

is.data.frame(df)

## [1] TRUE

Tibble:

tib <- tribble(
    ~x, ~y,
    "a", 1,
    "b", 2,
    "c", 3,
    "d", 4 
)

is_tibble(tib)

## [1] TRUE

is.data.frame(tib)

## [1] TRUE

`data.frame` v `tibble`

You will encounter 2 main differences:

printing
- by default, tibbles print the first 10 rows and all columns that fit on screen, making it easier to work with large datasets.
- also report the type of each column (e.g. <dbl>, <chr>)
subsetting - tibbles are more strict than data frames, which fixes two quirks we encountered when subsetting with [[ and $:
- tibbles never do partial matching
- they always generate a warning if the column you are trying to extract does not exist.

`readr`

The goal of readr is to provide a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV). It is designed to parse many types of data found in the wild, while providing an informative problem report when parsing leads to unexpected results.

readr docs

`read_*()`

The read_*() functions have two important arguments:

file - the path to the file
col_types - a list of how each column should be converted to a specific data type

7 supported file types, `read_*()`

read_csv(): comma-separated values (CSV)
read_tsv(): tab-separated values (TSV)
read_csv2(): semicolon-separated values
read_delim(): delimited files (CSV and TSV are important special cases)
read_fwf(): fixed-width files
read_table(): whitespace-separated files
read_log(): web log files

Read `csv` files

Path only, readr guesses types:

read_csv(file='"https://pos.it/r4ds-students-csv"')

. . .

Path and specify col_types:

read_csv(
    file='"https://pos.it/r4ds-students-csv"', 
    col_types = list( x = col_string(), y = col_skip() )
)

Guessing heuristic: character > date-time > double > logical

`col_types` column specification

There are 11 column types that can be specified:

col_logical() - reads as boolean TRUE FALSE values
col_integer() - reads as integer
col_double() - reads as double
col_number() - numeric parser that can ignore non-numbers
col_character() - reads as strings
col_factor(levels, ordered = FALSE) - creates factors
col_datetime(format = "") - creates date-times
col_date(format = "") - creates dates
col_time(format = "") - creates times
col_skip() - skips a column
col_guess() - tries to guess the column

Reading more complex files

Reading more complex file types requires functions outside the tidyverse:

excel with readxl - see Spreadsheets in R for Data Science
google sheets with googlesheets4 - see Spreadsheets in R for Data Science
databases with DBI - see Databases in R for Data Science
json data with jsonlite - see Hierarchical data in R for Data Science

Writing to a file

Write to a .csv file with

write_csv(students, "students.csv")

arguments: tibble, name to give file

Common problems `readr`

Data set containing 3 common problems

students <- read_csv('https://pos.it/r4ds-students-csv')

## Rows: 6 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Full Name, favourite.food, mealPlan, AGE
## dbl (1): Student ID
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

students

## # A tibble: 6 × 5
##   `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
##          <dbl> <chr>            <chr>              <chr>               <chr>
## 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
## 2            2 Barclay Lynn     French fries       Lunch only          5    
## 3            3 Jayendra Lyne    N/A                Breakfast and lunch 7    
## 4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
## 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
## 6            6 Güvenç Attila    Ice cream          Lunch only          6

Column contains unexpected values (AGE)
Missing values are not NA (AGE and favorite.food)
Column names have spaces (Student ID and Full Name)

Column contains unexpected values

Your dataset has a column that you expected to be logical or double, but there is a typo somewhere, so R has coerced the column into character.

students

## # A tibble: 6 × 5
##   `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
##          <dbl> <chr>            <chr>              <chr>               <chr>
## 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
## 2            2 Barclay Lynn     French fries       Lunch only          5    
## 3            3 Jayendra Lyne    N/A                Breakfast and lunch 7    
## 4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
## 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
## 6            6 Güvenç Attila    Ice cream          Lunch only          6

. . .

Solve by specifying the column type col_double() and then using the problems() function to see where R failed.

students_coerced <- read_csv(
    file = 'https://pos.it/r4ds-students-csv', 
    col_types = list(AGE = col_double()))

## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
##   dat <- vroom(...)
##   problems(dat)

problems(students_coerced)

## # A tibble: 1 × 5
##     row   col expected actual file 
##   <int> <int> <chr>    <chr>  <chr>
## 1     6     5 a double five   ""

Missing values are not `NA`

Your dataset has missing values, but they were not coded as NA as R expects.

students

## # A tibble: 6 × 5
##   `Student ID` `Full Name`      favourite.food     mealPlan            AGE  
##          <dbl> <chr>            <chr>              <chr>               <chr>
## 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
## 2            2 Barclay Lynn     French fries       Lunch only          5    
## 3            3 Jayendra Lyne    N/A                Breakfast and lunch 7    
## 4            4 Leon Rossini     Anchovies          Lunch only          <NA> 
## 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
## 6            6 Güvenç Attila    Ice cream          Lunch only          6

. . .

Solve by adding an na argument (e.g. na=c("N/A"))

(students_nas <- read_csv(
    file = 'https://pos.it/r4ds-students-csv', 
    na = c("N/A", "<NA>")))

## Rows: 6 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Full Name, favourite.food, mealPlan, AGE
## dbl (1): Student ID
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

## # A tibble: 6 × 5
##   `Student ID` `Full Name`      favourite.food     mealPlan            AGE   
##          <dbl> <chr>            <chr>              <chr>               <chr> 
## 1            1 Sunil Huffmann   Strawberry yoghurt Lunch only          "4"   
## 2            2 Barclay Lynn     French fries       Lunch only          "5"   
## 3            3 Jayendra Lyne    <NA>               Breakfast and lunch "7"   
## 4            4 Leon Rossini     Anchovies          Lunch only          ""    
## 5            5 Chidiegwu Dunkel Pizza              Breakfast and lunch "five"
## 6            6 Güvenç Attila    Ice cream          Lunch only          "6"

Column names have spaces

Your dataset has column names that include spaces, breaking R’s naming rules. In these cases, R adds backticks (e.g. `brain size`);

. . .

We can use the rename() function to fix them.

students %>% 
    rename(
        student_id = `Student ID`,
        full_name = `Full Name`
    )

## # A tibble: 6 × 5
##   student_id full_name        favourite.food     mealPlan            AGE  
##        <dbl> <chr>            <chr>              <chr>               <chr>
## 1          1 Sunil Huffmann   Strawberry yoghurt Lunch only          4    
## 2          2 Barclay Lynn     French fries       Lunch only          5    
## 3          3 Jayendra Lyne    N/A                Breakfast and lunch 7    
## 4          4 Leon Rossini     Anchovies          Lunch only          <NA> 
## 5          5 Chidiegwu Dunkel Pizza              Breakfast and lunch five 
## 6          6 Güvenç Attila    Ice cream          Lunch only          6

. . .

If we have a lot to rename and that gets annoying, see janitor::clean_names().

Remember dplyr?

Common structure of `dplyr` verbs

All dplyr functions (verbs) share a common structure:

1st argument is always a data frame
Subsequent arguments typically describe which columns to operate on (via their names)
Output is always a new data frame

Some `dplyr` verbs operate on

rows - filter(), arrange(), distinct()
columns - mutate(), select(), rename()
groups - group_by(), summarise(), ungroup()
tables - more on this next week…

We can easily combine dplyr functions to solve complex problems by piping together the various outputs as inputs.

Rows

Manipulate rows with `dplyr`

`filter()`	keep only some rows based on values of column
`arrange()`	arrange rows in order you specify
`distinct()`	finds unique rows

`fliter()`

Keep only some rows based on values of the column using logical operations

english %>%  
    select(RTlexdec:Word) %>%
    filter(Familiarity < 3.0) %>%
    head(10) # this extracts a huge number of rows still so I'm taking the top 10 just for reference / display ehre

##    RTlexdec RTnaming Familiarity  Word
## 1  6.543754 6.145044        2.37   doe
## 13 6.633582 6.197869        1.72 sprig
## 15 6.472130 6.225944        2.23 hooch
## 22 6.545148 6.159729        1.97  pith
## 23 6.655440 6.214008        1.44 whorl
## 26 6.509648 6.151668        2.83 blitz
## 28 6.566293 6.190110        1.93  prow
## 38 6.542760 6.216606        2.03 spire
## 45 6.775138 6.230875        1.31 lisle
## 49 6.610979 6.159307        1.43  yore

Common mistake is using =

english %>%  
    select(RTlexdec:Word) %>%
    filter(Word = 'zoo')

## Error in `filter()`:
## ! We detected a named input.
## ℹ This usually means that you've used `=` instead of `==`.
## ℹ Did you mean `Word == "zoo"`?

instead of ==

english %>%  
    select(RTlexdec:Word) %>%
     filter(Word == 'zoo')

##      RTlexdec RTnaming Familiarity Word
## 1144 6.263855 6.169611         3.3  zoo
## 2596 6.782079 6.555926         3.3  zoo

`arrange()`

Arranges rows in the order you specify

english %>%  
    select(RTlexdec:Word) %>%
    arrange(Word) %>% # same issue as with filter above... just displaying the top 10 for reference...
    head(10)

##      RTlexdec RTnaming Familiarity Word
## 338  6.435525 6.123150        3.80  ace
## 1790 6.653727 6.409846        3.80  ace
## 3125 6.425031 6.099870        4.87  act
## 3957 6.573010 6.460999        4.87  act
## 3313 6.355587 6.148041        5.27  add
## 4145 6.609605 6.405889        5.27  add
## 337  6.384216 6.140100        5.53  age
## 1789 6.617898 6.407540        5.53  age
## 3513 6.294657 6.142467        4.37  aid
## 4345 6.715092 6.426327        4.37  aid

Will use ascending order

english %>%  
    select(RTlexdec:Word) %>%
    arrange(Word) %>%
    head(10)

##      RTlexdec RTnaming Familiarity Word
## 338  6.435525 6.123150        3.80  ace
## 1790 6.653727 6.409846        3.80  ace
## 3125 6.425031 6.099870        4.87  act
## 3957 6.573010 6.460999        4.87  act
## 3313 6.355587 6.148041        5.27  add
## 4145 6.609605 6.405889        5.27  add
## 337  6.384216 6.140100        5.53  age
## 1789 6.617898 6.407540        5.53  age
## 3513 6.294657 6.142467        4.37  aid
## 4345 6.715092 6.426327        4.37  aid

Unless you specify descending

english %>%  
    select(RTlexdec:Word) %>%
    arrange(desc(Word)) %>%
    head(10)

##      RTlexdec RTnaming Familiarity Word
## 1144 6.263855 6.169611        3.30  zoo
## 2596 6.782079 6.555926        3.30  zoo
## 66   6.379037 6.145687        4.53 zone
## 1518 6.582441 6.486618        4.53 zone
## 398  6.390526 6.149109        4.43  zip
## 1850 6.609996 6.519147        4.43  zip
## 796  6.649075 6.222973        2.17 zing
## 2248 6.861847 6.641574        2.17 zing
## 787  6.537358 6.240276        3.73 zinc
## 2239 6.789636 6.528543        3.73 zinc

`distinct()`

Finds unique rows in a dataset; no arguments removes duplicates in dataset (if any exist)

english %>%  
    select(RTlexdec:Word) %>%
    distinct(Word) %>%
    head(10)

##      Word
## 1     doe
## 2   whore
## 3  stress
## 4    pork
## 5    plug
## 6    prop
## 7    dawn
## 8     dog
## 9     arc
## 10  skirt

You can optionally specify columns to find unique combinations

Returns only the columns you specify

english %>%  
    select(RTlexdec:Word) %>%
    distinct(Word) %>%
    head(10)

##      Word
## 1     doe
## 2   whore
## 3  stress
## 4    pork
## 5    plug
## 6    prop
## 7    dawn
## 8     dog
## 9     arc
## 10  skirt

Unless you add .keep_all=TRUE argument

english %>%  
    select(RTlexdec:Word) %>%
    distinct(Word, .keep_all=TRUE) %>%
    head(10)

##    RTlexdec RTnaming Familiarity   Word
## 1  6.543754 6.145044        2.37    doe
## 2  6.397596 6.246882        4.43  whore
## 3  6.304942 6.143756        5.60 stress
## 4  6.424221 6.131878        3.87   pork
## 5  6.450597 6.198479        3.93   plug
## 6  6.531970 6.167726        3.27   prop
## 7  6.370586 6.123808        3.73   dawn
## 8  6.266859 6.096050        5.67    dog
## 9  6.608648 6.117657        3.10    arc
## 10 6.284843 6.179188        4.43  skirt

Columns

Manipulate columns with `dplyr`

`mutate()`	adds new columns calculated from existing columns
`select()`	selects columns based on their names
`rename()`	renames some columns

`select()`

Select columns based on their names

english %>%  
    select(RTlexdec, Familiarity, AgeSubject) %>%
    head(10)

##    RTlexdec Familiarity AgeSubject
## 1  6.543754        2.37      young
## 2  6.397596        4.43      young
## 3  6.304942        5.60      young
## 4  6.424221        3.87      young
## 5  6.450597        3.93      young
## 6  6.531970        3.27      young
## 7  6.370586        3.73      young
## 8  6.266859        5.67      young
## 9  6.608648        3.10      young
## 10 6.284843        4.43      young

Use : to select everything from one column to another

english %>%  
    select(RTlexdec:AgeSubject) %>%
    head(10)

##    RTlexdec RTnaming Familiarity   Word AgeSubject
## 1  6.543754 6.145044        2.37    doe      young
## 2  6.397596 6.246882        4.43  whore      young
## 3  6.304942 6.143756        5.60 stress      young
## 4  6.424221 6.131878        3.87   pork      young
## 5  6.450597 6.198479        3.93   plug      young
## 6  6.531970 6.167726        3.27   prop      young
## 7  6.370586 6.123808        3.73   dawn      young
## 8  6.266859 6.096050        5.67    dog      young
## 9  6.608648 6.117657        3.10    arc      young
## 10 6.284843 6.179188        4.43  skirt      young

Use logical operators like & or ! to identify the subset of columns you want to select

english %>%  
    select(!RTlexdec:AgeSubject) %>%
    head(10)

##    WordCategory WrittenFrequency WrittenSpokenFrequencyRatio FamilySize
## 1             N         3.912023                  1.02165125   1.386294
## 2             N         4.521789                  0.35048297   1.386294
## 3             N         6.505784                  2.08935600   1.609438
## 4             N         5.017280                 -0.52633389   1.945910
## 5             N         4.890349                 -1.04454507   2.197225
## 6             N         4.770685                  0.92480142   1.386294
## 7             N         6.383507                 -0.01542102   1.098612
## 8             N         7.159292                 -1.04821897   3.688879
## 9             N         4.890349                  2.91626810   1.609438
## 10            N         5.929589                 -0.07873252   1.791759
##    DerivationalEntropy InflectionalEntropy NumberSimplexSynsets
## 1              0.14144             0.02114            0.6931472
## 2              0.42706             0.94198            1.0986123
## 3              0.06197             1.44339            2.4849066
## 4              0.43035             0.00000            1.0986123
## 5              0.35920             1.75393            2.4849066
## 6              0.06268             1.74730            1.6094379
## 7              0.00000             0.77898            1.7917595
## 8              1.03052             1.05129            2.0794415
## 9              0.15894             0.57715            1.6094379
## 10             0.06664             1.34797            1.9459101
##    NumberComplexSynsets LengthInLetters Ncount MeanBigramFrequency
## 1              0.000000               3      8            7.036333
## 2              0.000000               5      5            9.537878
## 3              1.945910               6      0            9.883931
## 4              2.639057               4      8            8.309180
## 5              2.484907               4      3            7.943717
## 6              1.386294               4      9            8.349620
## 7              1.098612               4      6            7.699792
## 8              4.465908               3     13            7.139341
## 9              2.079442               3      3            7.747861
## 10             2.079442               5      3            8.561149
##    FrequencyInitialDiphone ConspelV ConspelN ConphonV ConphonN ConfriendsV
## 1                 12.02268       10 3.737670       41 8.837826           8
## 2                 12.59780       20 7.870930       38 9.775825          20
## 3                 13.30069       10 6.693324       13 7.040536          10
## 4                 12.07807        5 6.677083        6 3.828641           4
## 5                 11.92678       17 4.762174       17 4.762174          17
## 6                 12.19724       19 6.234411       21 6.249975          19
## 7                 11.32815       10 4.779123       13 8.864323          10
## 8                 12.02268       13 4.795791        7 4.770685           6
## 9                 13.29079        1 3.737670       11 6.091310           0
## 10                10.38890        7 4.624973       14 5.164786           7
##    ConfriendsN    ConffV   ConffN   ConfbV   ConfbN NounFrequency VerbFrequency
## 1     3.295837 0.6931472 2.708050 3.496508 8.833900            49             0
## 2     7.870930 0.0000000 0.000000 2.944439 9.614738           142             0
## 3     6.693324 0.0000000 0.000000 1.386294 5.817111           565           473
## 4     3.526361 0.6931472 6.634633 1.098612 2.564949           150             0
## 5     4.762174 0.0000000 0.000000 0.000000 0.000000           170           120
## 6     6.234411 0.0000000 0.000000 1.098612 2.197225           125           280
## 7     4.779123 0.0000000 0.000000 1.386294 8.847504           582           110
## 8     3.761200 1.9459101 1.386294 0.000000 0.000000          2061            76
## 9     0.000000 0.0000000 0.000000 2.397895 5.993961           144             4
## 10    4.624973 0.0000000 0.000000 2.079442 4.304065           522            86
##    CV Obstruent Frication     Voice FrequencyInitialDiphoneWord
## 1   C      obst     burst    voiced                   10.129308
## 2   C      obst frication voiceless                    9.054388
## 3   C      obst frication voiceless                   12.422026
## 4   C      obst     burst voiceless                   10.048151
## 5   C      obst     burst voiceless                   11.796336
## 6   C      obst     burst voiceless                   11.991567
## 7   C      obst     burst    voiced                    9.408125
## 8   C      obst     burst    voiced                    9.755336
## 9   V      cont      long    voiced                    6.424869
## 10  C      obst frication voiceless                   11.129011
##    FrequencyInitialDiphoneSyllable CorrectLexdec
## 1                        10.409763            27
## 2                         9.148252            30
## 3                        13.127395            30
## 4                        11.003649            30
## 5                        12.163092            26
## 6                        12.436772            28
## 7                         9.772410            30
## 8                        10.255024            28
## 9                         6.538140            25
## 10                       11.578563            29

You can rename columns when you select them

english %>%  
    select(RTlex = RTlexdec, Familiarity, AgeSubject) %>%
    head(10)

##       RTlex Familiarity AgeSubject
## 1  6.543754        2.37      young
## 2  6.397596        4.43      young
## 3  6.304942        5.60      young
## 4  6.424221        3.87      young
## 5  6.450597        3.93      young
## 6  6.531970        3.27      young
## 7  6.370586        3.73      young
## 8  6.266859        5.67      young
## 9  6.608648        3.10      young
## 10 6.284843        4.43      young

`rename()`

Keep all columns but rename one or more

english %>%  
    rename(RTlex = RTlexdec, WordCat = WordCategory) %>%
    head(10)

##       RTlex RTnaming Familiarity   Word AgeSubject WordCat WrittenFrequency
## 1  6.543754 6.145044        2.37    doe      young       N         3.912023
## 2  6.397596 6.246882        4.43  whore      young       N         4.521789
## 3  6.304942 6.143756        5.60 stress      young       N         6.505784
## 4  6.424221 6.131878        3.87   pork      young       N         5.017280
## 5  6.450597 6.198479        3.93   plug      young       N         4.890349
## 6  6.531970 6.167726        3.27   prop      young       N         4.770685
## 7  6.370586 6.123808        3.73   dawn      young       N         6.383507
## 8  6.266859 6.096050        5.67    dog      young       N         7.159292
## 9  6.608648 6.117657        3.10    arc      young       N         4.890349
## 10 6.284843 6.179188        4.43  skirt      young       N         5.929589
##    WrittenSpokenFrequencyRatio FamilySize DerivationalEntropy
## 1                   1.02165125   1.386294             0.14144
## 2                   0.35048297   1.386294             0.42706
## 3                   2.08935600   1.609438             0.06197
## 4                  -0.52633389   1.945910             0.43035
## 5                  -1.04454507   2.197225             0.35920
## 6                   0.92480142   1.386294             0.06268
## 7                  -0.01542102   1.098612             0.00000
## 8                  -1.04821897   3.688879             1.03052
## 9                   2.91626810   1.609438             0.15894
## 10                 -0.07873252   1.791759             0.06664
##    InflectionalEntropy NumberSimplexSynsets NumberComplexSynsets
## 1              0.02114            0.6931472             0.000000
## 2              0.94198            1.0986123             0.000000
## 3              1.44339            2.4849066             1.945910
## 4              0.00000            1.0986123             2.639057
## 5              1.75393            2.4849066             2.484907
## 6              1.74730            1.6094379             1.386294
## 7              0.77898            1.7917595             1.098612
## 8              1.05129            2.0794415             4.465908
## 9              0.57715            1.6094379             2.079442
## 10             1.34797            1.9459101             2.079442
##    LengthInLetters Ncount MeanBigramFrequency FrequencyInitialDiphone ConspelV
## 1                3      8            7.036333                12.02268       10
## 2                5      5            9.537878                12.59780       20
## 3                6      0            9.883931                13.30069       10
## 4                4      8            8.309180                12.07807        5
## 5                4      3            7.943717                11.92678       17
## 6                4      9            8.349620                12.19724       19
## 7                4      6            7.699792                11.32815       10
## 8                3     13            7.139341                12.02268       13
## 9                3      3            7.747861                13.29079        1
## 10               5      3            8.561149                10.38890        7
##    ConspelN ConphonV ConphonN ConfriendsV ConfriendsN    ConffV   ConffN
## 1  3.737670       41 8.837826           8    3.295837 0.6931472 2.708050
## 2  7.870930       38 9.775825          20    7.870930 0.0000000 0.000000
## 3  6.693324       13 7.040536          10    6.693324 0.0000000 0.000000
## 4  6.677083        6 3.828641           4    3.526361 0.6931472 6.634633
## 5  4.762174       17 4.762174          17    4.762174 0.0000000 0.000000
## 6  6.234411       21 6.249975          19    6.234411 0.0000000 0.000000
## 7  4.779123       13 8.864323          10    4.779123 0.0000000 0.000000
## 8  4.795791        7 4.770685           6    3.761200 1.9459101 1.386294
## 9  3.737670       11 6.091310           0    0.000000 0.0000000 0.000000
## 10 4.624973       14 5.164786           7    4.624973 0.0000000 0.000000
##      ConfbV   ConfbN NounFrequency VerbFrequency CV Obstruent Frication
## 1  3.496508 8.833900            49             0  C      obst     burst
## 2  2.944439 9.614738           142             0  C      obst frication
## 3  1.386294 5.817111           565           473  C      obst frication
## 4  1.098612 2.564949           150             0  C      obst     burst
## 5  0.000000 0.000000           170           120  C      obst     burst
## 6  1.098612 2.197225           125           280  C      obst     burst
## 7  1.386294 8.847504           582           110  C      obst     burst
## 8  0.000000 0.000000          2061            76  C      obst     burst
## 9  2.397895 5.993961           144             4  V      cont      long
## 10 2.079442 4.304065           522            86  C      obst frication
##        Voice FrequencyInitialDiphoneWord FrequencyInitialDiphoneSyllable
## 1     voiced                   10.129308                       10.409763
## 2  voiceless                    9.054388                        9.148252
## 3  voiceless                   12.422026                       13.127395
## 4  voiceless                   10.048151                       11.003649
## 5  voiceless                   11.796336                       12.163092
## 6  voiceless                   11.991567                       12.436772
## 7     voiced                    9.408125                        9.772410
## 8     voiced                    9.755336                       10.255024
## 9     voiced                    6.424869                        6.538140
## 10 voiceless                   11.129011                       11.578563
##    CorrectLexdec
## 1             27
## 2             30
## 3             30
## 4             30
## 5             26
## 6             28
## 7             30
## 8             28
## 9             25
## 10            29

`mutate()`

Add new columns that are calculated from exising columns

english %>%  
    select(RTlexdec:AgeSubject) %>%
    mutate(RTdiff = RTlexdec - RTnaming) %>%
    head(10)

##    RTlexdec RTnaming Familiarity   Word AgeSubject    RTdiff
## 1  6.543754 6.145044        2.37    doe      young 0.3987099
## 2  6.397596 6.246882        4.43  whore      young 0.1507144
## 3  6.304942 6.143756        5.60 stress      young 0.1611859
## 4  6.424221 6.131878        3.87   pork      young 0.2923421
## 5  6.450597 6.198479        3.93   plug      young 0.2521181
## 6  6.531970 6.167726        3.27   prop      young 0.3642442
## 7  6.370586 6.123808        3.73   dawn      young 0.2467779
## 8  6.266859 6.096050        5.67    dog      young 0.1708092
## 9  6.608648 6.117657        3.10    arc      young 0.4909916
## 10 6.284843 6.179188        4.43  skirt      young 0.1056547

Columns are added to the right by default, but you can specify where you’d like to add them by number or name

With .after

english %>%  
    select(RTlexdec:Familiarity) %>%
    mutate(
        RTdiff = RTlexdec - RTnaming,
        .after = RTnaming
    ) %>%
    head(10)

##    RTlexdec RTnaming    RTdiff Familiarity
## 1  6.543754 6.145044 0.3987099        2.37
## 2  6.397596 6.246882 0.1507144        4.43
## 3  6.304942 6.143756 0.1611859        5.60
## 4  6.424221 6.131878 0.2923421        3.87
## 5  6.450597 6.198479 0.2521181        3.93
## 6  6.531970 6.167726 0.3642442        3.27
## 7  6.370586 6.123808 0.2467779        3.73
## 8  6.266859 6.096050 0.1708092        5.67
## 9  6.608648 6.117657 0.4909916        3.10
## 10 6.284843 6.179188 0.1056547        4.43

With .before

english %>%  
    select(RTlexdec:Familiarity) %>%
    mutate(
        RTdiff = RTlexdec - RTnaming,
        .before = RTlexdec
    ) %>%
    head(10)

##       RTdiff RTlexdec RTnaming Familiarity
## 1  0.3987099 6.543754 6.145044        2.37
## 2  0.1507144 6.397596 6.246882        4.43
## 3  0.1611859 6.304942 6.143756        5.60
## 4  0.2923421 6.424221 6.131878        3.87
## 5  0.2521181 6.450597 6.198479        3.93
## 6  0.3642442 6.531970 6.167726        3.27
## 7  0.2467779 6.370586 6.123808        3.73
## 8  0.1708092 6.266859 6.096050        5.67
## 9  0.4909916 6.608648 6.117657        3.10
## 10 0.1056547 6.284843 6.179188        4.43

Group and summarise

Group and summarise with `dplyr`

`group_by()`	used to divide your dataset into groups
`summarise()`	often used after `group_by()` to calculate summary statistics on grouped data
`ungroup()`	used to remove the grouping

`group_by()`

Divide your dataset into groups

english %>%  
    select(RTlexdec, Familiarity, AgeSubject) %>%
    group_by(AgeSubject)

## # A tibble: 4,568 × 3
## # Groups:   AgeSubject [2]
##    RTlexdec Familiarity AgeSubject
##       <dbl>       <dbl> <fct>     
##  1     6.54        2.37 young     
##  2     6.40        4.43 young     
##  3     6.30        5.6  young     
##  4     6.42        3.87 young     
##  5     6.45        3.93 young     
##  6     6.53        3.27 young     
##  7     6.37        3.73 young     
##  8     6.27        5.67 young     
##  9     6.61        3.1  young     
## 10     6.28        4.43 young     
## # ℹ 4,558 more rows

Can group by more than one variable

english %>%  
    select(RTlexdec, Familiarity, AgeSubject, Voice) %>%
    group_by(AgeSubject, Voice)

## # A tibble: 4,568 × 4
## # Groups:   AgeSubject, Voice [4]
##    RTlexdec Familiarity AgeSubject Voice    
##       <dbl>       <dbl> <fct>      <fct>    
##  1     6.54        2.37 young      voiced   
##  2     6.40        4.43 young      voiceless
##  3     6.30        5.6  young      voiceless
##  4     6.42        3.87 young      voiceless
##  5     6.45        3.93 young      voiceless
##  6     6.53        3.27 young      voiceless
##  7     6.37        3.73 young      voiced   
##  8     6.27        5.67 young      voiced   
##  9     6.61        3.1  young      voiced   
## 10     6.28        4.43 young      voiceless
## # ℹ 4,558 more rows

Does not change original df, adds groups attribute

grouped_english <- english %>%  
    select(RTlexdec, Familiarity, AgeSubject) %>%
    group_by(AgeSubject)

grouped_english

## # A tibble: 4,568 × 3
## # Groups:   AgeSubject [2]
##    RTlexdec Familiarity AgeSubject
##       <dbl>       <dbl> <fct>     
##  1     6.54        2.37 young     
##  2     6.40        4.43 young     
##  3     6.30        5.6  young     
##  4     6.42        3.87 young     
##  5     6.45        3.93 young     
##  6     6.53        3.27 young     
##  7     6.37        3.73 young     
##  8     6.27        5.67 young     
##  9     6.61        3.1  young     
## 10     6.28        4.43 young     
## # ℹ 4,558 more rows

attr(english, "groups")

## NULL

attr(grouped_english, "groups")

## # A tibble: 2 × 2
##   AgeSubject       .rows
##   <fct>      <list<int>>
## 1 old            [2,284]
## 2 young          [2,284]

. . .

`summarise()`

Often used after group_by() to calculate summary stats on grouped data

summary_english <- english %>%  
    select(RTlexdec, Familiarity, AgeSubject) %>%
    group_by(AgeSubject) %>%
    summarise(n = n())

summary_english

## # A tibble: 2 × 2
##   AgeSubject     n
##   <fct>      <int>
## 1 old         2284
## 2 young       2284

You can add any number of summary stats

summary_english <- english %>%  
    select(RTlexdec, Familiarity, AgeSubject, Voice) %>%
    group_by(AgeSubject, Voice) %>%
    summarise(
        n = n(), 
        mean = mean(RTlexdec)
    )

## `summarise()` has grouped output by 'AgeSubject'. You can override using the
## `.groups` argument.

summary_english

## # A tibble: 4 × 4
## # Groups:   AgeSubject [2]
##   AgeSubject Voice         n  mean
##   <fct>      <fct>     <int> <dbl>
## 1 old        voiced     1030  6.67
## 2 old        voiceless  1254  6.66
## 3 young      voiced     1030  6.44
## 4 young      voiceless  1254  6.43

Note that the returned dataframe is grouped!

Use the .groups argument to drop or keep grouping in returned dataframe

Drop all groups

summary_english <- english %>%  
    select(RTlexdec, Familiarity, AgeSubject, Voice) %>%
    group_by(AgeSubject, Voice) %>%
    summarise(
        n = n(), 
        mean = mean(RTlexdec), 
        .groups = "drop"
    )

summary_english

## # A tibble: 4 × 4
##   AgeSubject Voice         n  mean
##   <fct>      <fct>     <int> <dbl>
## 1 old        voiced     1030  6.67
## 2 old        voiceless  1254  6.66
## 3 young      voiced     1030  6.44
## 4 young      voiceless  1254  6.43

Keep all groups

summary_english <- english %>%  
    select(RTlexdec, Familiarity, AgeSubject, Voice) %>%
    group_by(AgeSubject, Voice) %>%
    summarise(
        n = n(), 
        mean = mean(RTlexdec), 
        .groups = "keep"
    )

summary_english

## # A tibble: 4 × 4
## # Groups:   AgeSubject, Voice [4]
##   AgeSubject Voice         n  mean
##   <fct>      <fct>     <int> <dbl>
## 1 old        voiced     1030  6.67
## 2 old        voiceless  1254  6.66
## 3 young      voiced     1030  6.44
## 4 young      voiceless  1254  6.43

Or use the new .by argument instead of group_by() to return an ungrouped dataframe

Drop all groups

summary_english <- english %>%  
    select(RTlexdec, Familiarity, AgeSubject, Voice) %>%
    group_by(AgeSubject, Voice) %>%
    summarise(
        n = n(), 
        mean = mean(RTlexdec), 
        .groups = "drop"
    )

summary_english

## # A tibble: 4 × 4
##   AgeSubject Voice         n  mean
##   <fct>      <fct>     <int> <dbl>
## 1 old        voiced     1030  6.67
## 2 old        voiceless  1254  6.66
## 3 young      voiced     1030  6.44
## 4 young      voiceless  1254  6.43

Or just use by!

summary_english <- english %>%  
    select(RTlexdec, Familiarity, AgeSubject, Voice) %>%
    summarise(
        n = n(), 
        mean = mean(RTlexdec), 
        .by = c(AgeSubject, Voice)
    )

summary_english

##   AgeSubject     Voice    n     mean
## 1      young    voiced 1030 6.444450
## 2      young voiceless 1254 6.434955
## 3        old    voiced 1030 6.666049
## 4        old voiceless 1254 6.656777

`ungroup()`

Can also remove grouping after with ungroup()

Drop all groups

summary_english <- english %>%  
    select(RTlexdec, Familiarity, AgeSubject, Voice) %>%
    group_by(AgeSubject, Voice) %>%
    summarise(
        n = n(), 
        mean = mean(RTlexdec), 
        .groups = "drop"
    )

summary_english

## # A tibble: 4 × 4
##   AgeSubject Voice         n  mean
##   <fct>      <fct>     <int> <dbl>
## 1 old        voiced     1030  6.67
## 2 old        voiceless  1254  6.66
## 3 young      voiced     1030  6.44
## 4 young      voiceless  1254  6.43

Or ungroup after

summary_english <- english %>%  
    select(RTlexdec, Familiarity, AgeSubject, Voice) %>%
    summarise(
        n = n(), 
        mean = mean(RTlexdec), 
        .by = c(AgeSubject, Voice)
    ) %>% 
    ungroup()

summary_english

##   AgeSubject     Voice    n     mean
## 1      young    voiced 1030 6.444450
## 2      young voiceless 1254 6.434955
## 3        old    voiced 1030 6.666049
## 4        old voiceless 1254 6.656777

P5 (Data wrangling)

Spencer Caplan (adapted from material via Katie Schuler)

2025-10-16

Picking up from last time

Facets

facet_grid()

facet_grid() - just columns

facet_grid() - just columns

facet_grid() - just rows

facet_wrap()

facet_wrap() - number of columns

Helper functions

remember our goal plot from last week?

last_plot()

ggsave()

Themes

Default theme

Sample themes

Custom themes

Shortcuts

ggplot2 calls

the pipe %>%

the pipe %>% and ggplot

Now that’s in place let’s return to getting some data in place – Tidyverse

Tidyverse

Loading the tidyverse

Tidy data

Tidy data

purr

The map_*() functions

The map_*() functions

map use case

tibble

Create a tibble

Test if tibble

data.frame v tibble

readr

read_*()

7 supported file types, read_*()

Read csv files

col_types column specification

Reading more complex files

Writing to a file

Common problems readr

Data set containing 3 common problems

Column contains unexpected values

Missing values are not NA

Column names have spaces

Remember dplyr?

Common structure of dplyr verbs

Some dplyr verbs operate on

Rows

Manipulate rows with dplyr

fliter()

arrange()

distinct()

Columns

Manipulate columns with dplyr

select()

rename()

mutate()

Group and summarise

Group and summarise with dplyr

group_by()

summarise()

ungroup()

We’ll pick up with joining dfs and pivots (longer vs. wider) next time..!

`facet_grid()`

`facet_grid()` - just columns

`facet_grid()` - just columns

`facet_grid()` - just rows

`facet_wrap()`

`facet_wrap()` - number of columns

`last_plot()`

`ggsave()`

the pipe `%>%`

the pipe `%>%` and ggplot

`purr`

The `map_*()` functions

The `map_*()` functions

`map` use case

`tibble`

Create a `tibble`

Test if `tibble`

`data.frame` v `tibble`

`readr`

`read_*()`

7 supported file types, `read_*()`

Read `csv` files

`col_types` column specification

Common problems `readr`

Missing values are not `NA`

Common structure of `dplyr` verbs

Some `dplyr` verbs operate on

Manipulate rows with `dplyr`

`fliter()`

`arrange()`

`distinct()`

Manipulate columns with `dplyr`

`select()`

`rename()`

`mutate()`

Group and summarise with `dplyr`

`group_by()`

`summarise()`

`ungroup()`