We ended last time before covering how Facets work so let’s do that now.
smaller plots that display different subsets of data
facet_grid()ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point() +
facet_grid(Class ~ Complex) +
theme_classic(base_size = 20)
. . .
Compare with the same data, viewed with two aesthetics (color and shape)
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
aes(color = Class, shape = Complex)
) +
theme_classic(base_size = 20)
facet_grid() - just columnsggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point() +
facet_grid(. ~ Complex) +
theme_classic(base_size = 20)
facet_grid() - just columnsand note we can still map other aesthetics!
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
aes(color = Class),
shape = "triangle"
) +
facet_grid(. ~ Complex) +
theme_classic(base_size = 20)
facet_grid() - just rowsggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point() +
facet_grid(Class ~ .) +
theme_classic(base_size = 20)
facet_wrap()ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point() +
facet_wrap(~ Class) +
theme_classic(base_size = 20)
facet_wrap() - number of columnsggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point() +
facet_wrap(~ Class, ncol = 1) +
theme_classic(base_size = 20)
Now we have the pieces to make it ourselves.
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point(
mapping = aes(color = Class),
size = 3
) +
labs(
title = "Subjective frequency ratings",
subtitle = "for 81 english nouns",
x = "Actual frequency",
y = "Frequency rating",
color = "word class"
) +
theme_classic(base_size = 20) +
scale_color_brewer(palette = "Paired")
last_plot()returns the last plot
last_plot()
ggsave()saves last plot
ggsave("plot.png", width=5, height=5)
ggplot comes with many Complete themes
last_plot() + theme_gray(base_size=20)
last_plot() + theme_bw(base_size=20)
last_plot() + theme_classic(base_size=20)
last_plot() + theme_minimal(base_size=20)
last_plot() + theme_void(base_size=20)
I (Spencer) have found that I keep returning to the same sorts of visualization over and over across different research projects and rather than repeat code it becomes useful to define my own theme and apply that (or variants) when I need it. You’ll see an example of that on HW2.
Explicit argument names:
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point()
Implied argument names:
ggplot(
ratings,
aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point()
%>%the pipe takes the thing on its left and passes it along to the function on its right
. . .
x %>% f(y) is equivalent to f(x, y)
. . .
library(magrittr)
##
## Attaching package: 'magrittr'
## The following object is masked from 'package:purrr':
##
## set_names
## The following object is masked from 'package:tidyr':
##
## extract
x <- c(1.0, 2.245, 3, 4.22222)
x
## [1] 1.00000 2.24500 3.00000 4.22222
# pass x as an argument to function usual way
round(x, digits = 2)
## [1] 1.00 2.24 3.00 4.22
. . .
# pass x as an argument to function with pipe
x %>% round(digits = 2)
## [1] 1.00 2.24 3.00 4.22
There are two ways to write the pipe: %>% or
|>
%>% and ggplotImplied argument names:
ggplot(
data = ratings,
mapping = aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point()
Implied argument names + pipe:
ratings %>%
ggplot(
aes(
x = Frequency,
y = meanFamiliarity
)
) +
geom_point()
Note that we pipe %>% in arguments to functions but
we ADD + layers to ggplot. Common mistake!
The tidyverse is an opinionated collection of R packages designed for data science. All packages share an underlying design philosophy, grammar, and data structures.
ggplot2 - for data visualizationdplyr - for data wranglingreadr - for reading datatibble - for modern data framesstringr: for string manipulationforcats: for dealing with factorstidyr: for data tidyingpurrr: for functional programmingYou just need to be use to run / include “library(tidyverse)” at the top of your script
Tidyverse makes use of tidy data, a standard way of structuring datasets:
Why tidy data?
purrFunctional programming
to illustrate the joy of tidyverse and tidy data
purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors. If you’ve never heard of FP before, the best place to start is the family of map() functions which allow you to replace many for loops with code that is both more succinct and easier to read.
map_*() functionsmap_*() functionsWe say “functions” because there are 5, one for each type of vector:
map() - listmap_lgl() - logicalmap_int() - integermap_dbl() - doublemap_chr() - charactermap use casedf <- data.frame(
x = 1:10,
y = 11:20,
z = 21:30
)
with copy+paste
mean(df$x)
## [1] 5.5
mean(df$y)
## [1] 15.5
mean(df$z)
## [1] 25.5
with map
map(df, mean)
## $x
## [1] 5.5
##
## $y
## [1] 15.5
##
## $z
## [1] 25.5
tibblemodern data frames
A tibble, or tbl_df, is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not. Tibbles are data.frames that are lazy and surly: they do less and complain more
Tibbles do less than data frames, in a good way:
The take-away is that data.frame and tibble
sometimes behave differently. The behavior of tibble makes
more sense for modern data science, so we should us it instead!
tibbleCoerce an existing object:
df <- data.frame(
x = 1:4,
y = c("a", "b", "c", "d")
)
as_tibble(df)
## # A tibble: 4 × 2
## x y
## <int> <chr>
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
Pass a column of vectors:
tibble(
x = 1:4,
y = c("a", "b", "c", "d")
)
## # A tibble: 4 × 2
## x y
## <int> <chr>
## 1 1 a
## 2 2 b
## 3 3 c
## 4 4 d
Define row-by-row (see here):
tribble(
~x, ~y,
"a", 1,
"b", 2,
"c", 3,
"d", 4
)
## # A tibble: 4 × 2
## x y
## <chr> <dbl>
## 1 a 1
## 2 b 2
## 3 c 3
## 4 d 4
tibbleWith is_tibble(x) and is.data.frame(x)
Data frame:
df <- data.frame(
x = 1:4,
y = c("a", "b", "c", "d")
)
is_tibble(df)
## [1] FALSE
is.data.frame(df)
## [1] TRUE
Tibble:
tib <- tribble(
~x, ~y,
"a", 1,
"b", 2,
"c", 3,
"d", 4
)
is_tibble(tib)
## [1] TRUE
is.data.frame(tib)
## [1] TRUE
data.frame v tibbleYou will encounter 2 main differences:
<dbl>,
<chr>)[[ and $:
readrThe goal of readr is to provide a fast and friendly way to read rectangular data from delimited files, such as comma-separated values (CSV) and tab-separated values (TSV). It is designed to parse many types of data found in the wild, while providing an informative problem report when parsing leads to unexpected results.
read_*()The read_*() functions have two important arguments:
file - the path to the filecol_types - a list of how each column should be
converted to a specific data typeread_*()read_csv(): comma-separated values (CSV)read_tsv(): tab-separated values (TSV)read_csv2(): semicolon-separated valuesread_delim(): delimited files (CSV and TSV are
important special cases)read_fwf(): fixed-width filesread_table(): whitespace-separated filesread_log(): web log filescsv filesPath only, readr guesses types:
read_csv(file='"https://pos.it/r4ds-students-csv"')
. . .
Path and specify col_types:
read_csv(
file='"https://pos.it/r4ds-students-csv"',
col_types = list( x = col_string(), y = col_skip() )
)
Guessing heuristic: character > date-time > double > logical
col_types column specificationThere are 11 column types that can be specified:
col_logical() - reads as boolean TRUE FALSE valuescol_integer() - reads as integercol_double() - reads as doublecol_number() - numeric parser that can ignore
non-numberscol_character() - reads as stringscol_factor(levels, ordered = FALSE) - creates
factorscol_datetime(format = "") - creates date-timescol_date(format = "") - creates datescol_time(format = "") - creates timescol_skip() - skips a columncol_guess() - tries to guess the columnReading more complex file types requires functions outside the tidyverse:
readxl - see Spreadsheets in R
for Data Sciencegooglesheets4 - see
Spreadsheets
in R for Data ScienceDBI - see Databases in R for Data
Sciencejsonlite - see Hierarchical data in R for
Data ScienceWrite to a .csv file with
write_csv(students, "students.csv")
arguments: tibble, name to give file
readrstudents <- read_csv('https://pos.it/r4ds-students-csv')
## Rows: 6 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Full Name, favourite.food, mealPlan, AGE
## dbl (1): Student ID
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
students
## # A tibble: 6 × 5
## `Student ID` `Full Name` favourite.food mealPlan AGE
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
## 2 2 Barclay Lynn French fries Lunch only 5
## 3 3 Jayendra Lyne N/A Breakfast and lunch 7
## 4 4 Leon Rossini Anchovies Lunch only <NA>
## 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
## 6 6 Güvenç Attila Ice cream Lunch only 6
AGE)NA (AGE and
favorite.food)Student ID and
Full Name)Your dataset has a column that you expected to be
logical or double, but there is a typo
somewhere, so R has coerced the column into character.
students
## # A tibble: 6 × 5
## `Student ID` `Full Name` favourite.food mealPlan AGE
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
## 2 2 Barclay Lynn French fries Lunch only 5
## 3 3 Jayendra Lyne N/A Breakfast and lunch 7
## 4 4 Leon Rossini Anchovies Lunch only <NA>
## 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
## 6 6 Güvenç Attila Ice cream Lunch only 6
. . .
Solve by specifying the column type col_double() and
then using the problems() function to see where R
failed.
students_coerced <- read_csv(
file = 'https://pos.it/r4ds-students-csv',
col_types = list(AGE = col_double()))
## Warning: One or more parsing issues, call `problems()` on your data frame for details,
## e.g.:
## dat <- vroom(...)
## problems(dat)
problems(students_coerced)
## # A tibble: 1 × 5
## row col expected actual file
## <int> <int> <chr> <chr> <chr>
## 1 6 5 a double five ""
NAYour dataset has missing values, but they were not coded as
NA as R expects.
students
## # A tibble: 6 × 5
## `Student ID` `Full Name` favourite.food mealPlan AGE
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
## 2 2 Barclay Lynn French fries Lunch only 5
## 3 3 Jayendra Lyne N/A Breakfast and lunch 7
## 4 4 Leon Rossini Anchovies Lunch only <NA>
## 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
## 6 6 Güvenç Attila Ice cream Lunch only 6
. . .
Solve by adding an na argument
(e.g. na=c("N/A"))
(students_nas <- read_csv(
file = 'https://pos.it/r4ds-students-csv',
na = c("N/A", "<NA>")))
## Rows: 6 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): Full Name, favourite.food, mealPlan, AGE
## dbl (1): Student ID
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## # A tibble: 6 × 5
## `Student ID` `Full Name` favourite.food mealPlan AGE
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 Sunil Huffmann Strawberry yoghurt Lunch only "4"
## 2 2 Barclay Lynn French fries Lunch only "5"
## 3 3 Jayendra Lyne <NA> Breakfast and lunch "7"
## 4 4 Leon Rossini Anchovies Lunch only ""
## 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch "five"
## 6 6 Güvenç Attila Ice cream Lunch only "6"
Your dataset has column names that include spaces, breaking R’s
naming rules. In these cases, R adds backticks
(e.g. `brain size`);
. . .
We can use the rename() function to fix them.
students %>%
rename(
student_id = `Student ID`,
full_name = `Full Name`
)
## # A tibble: 6 × 5
## student_id full_name favourite.food mealPlan AGE
## <dbl> <chr> <chr> <chr> <chr>
## 1 1 Sunil Huffmann Strawberry yoghurt Lunch only 4
## 2 2 Barclay Lynn French fries Lunch only 5
## 3 3 Jayendra Lyne N/A Breakfast and lunch 7
## 4 4 Leon Rossini Anchovies Lunch only <NA>
## 5 5 Chidiegwu Dunkel Pizza Breakfast and lunch five
## 6 6 Güvenç Attila Ice cream Lunch only 6
. . .
If we have a lot to rename and that gets annoying, see
janitor::clean_names().
dplyr verbsAll dplyr functions (verbs) share a common
structure:
dplyr verbs operate onfilter(), arrange(),
distinct()mutate(), select(),
rename()group_by(), summarise(),
ungroup()We can easily combine dplyr functions to solve complex
problems by piping together the various outputs as inputs.
dplyrfilter() |
keep only some rows based on values of column |
arrange() |
arrange rows in order you specify |
distinct() |
finds unique rows |
fliter()Keep only some rows based on values of the column using logical operations
english %>%
select(RTlexdec:Word) %>%
filter(Familiarity < 3.0) %>%
head(10) # this extracts a huge number of rows still so I'm taking the top 10 just for reference / display ehre
## RTlexdec RTnaming Familiarity Word
## 1 6.543754 6.145044 2.37 doe
## 13 6.633582 6.197869 1.72 sprig
## 15 6.472130 6.225944 2.23 hooch
## 22 6.545148 6.159729 1.97 pith
## 23 6.655440 6.214008 1.44 whorl
## 26 6.509648 6.151668 2.83 blitz
## 28 6.566293 6.190110 1.93 prow
## 38 6.542760 6.216606 2.03 spire
## 45 6.775138 6.230875 1.31 lisle
## 49 6.610979 6.159307 1.43 yore
Common mistake is using =
english %>%
select(RTlexdec:Word) %>%
filter(Word = 'zoo')
## Error in `filter()`:
## ! We detected a named input.
## ℹ This usually means that you've used `=` instead of `==`.
## ℹ Did you mean `Word == "zoo"`?
instead of ==
english %>%
select(RTlexdec:Word) %>%
filter(Word == 'zoo')
## RTlexdec RTnaming Familiarity Word
## 1144 6.263855 6.169611 3.3 zoo
## 2596 6.782079 6.555926 3.3 zoo
arrange()Arranges rows in the order you specify
english %>%
select(RTlexdec:Word) %>%
arrange(Word) %>% # same issue as with filter above... just displaying the top 10 for reference...
head(10)
## RTlexdec RTnaming Familiarity Word
## 338 6.435525 6.123150 3.80 ace
## 1790 6.653727 6.409846 3.80 ace
## 3125 6.425031 6.099870 4.87 act
## 3957 6.573010 6.460999 4.87 act
## 3313 6.355587 6.148041 5.27 add
## 4145 6.609605 6.405889 5.27 add
## 337 6.384216 6.140100 5.53 age
## 1789 6.617898 6.407540 5.53 age
## 3513 6.294657 6.142467 4.37 aid
## 4345 6.715092 6.426327 4.37 aid
Will use ascending order
english %>%
select(RTlexdec:Word) %>%
arrange(Word) %>%
head(10)
## RTlexdec RTnaming Familiarity Word
## 338 6.435525 6.123150 3.80 ace
## 1790 6.653727 6.409846 3.80 ace
## 3125 6.425031 6.099870 4.87 act
## 3957 6.573010 6.460999 4.87 act
## 3313 6.355587 6.148041 5.27 add
## 4145 6.609605 6.405889 5.27 add
## 337 6.384216 6.140100 5.53 age
## 1789 6.617898 6.407540 5.53 age
## 3513 6.294657 6.142467 4.37 aid
## 4345 6.715092 6.426327 4.37 aid
Unless you specify descending
english %>%
select(RTlexdec:Word) %>%
arrange(desc(Word)) %>%
head(10)
## RTlexdec RTnaming Familiarity Word
## 1144 6.263855 6.169611 3.30 zoo
## 2596 6.782079 6.555926 3.30 zoo
## 66 6.379037 6.145687 4.53 zone
## 1518 6.582441 6.486618 4.53 zone
## 398 6.390526 6.149109 4.43 zip
## 1850 6.609996 6.519147 4.43 zip
## 796 6.649075 6.222973 2.17 zing
## 2248 6.861847 6.641574 2.17 zing
## 787 6.537358 6.240276 3.73 zinc
## 2239 6.789636 6.528543 3.73 zinc
distinct()Finds unique rows in a dataset; no arguments removes duplicates in dataset (if any exist)
english %>%
select(RTlexdec:Word) %>%
distinct(Word) %>%
head(10)
## Word
## 1 doe
## 2 whore
## 3 stress
## 4 pork
## 5 plug
## 6 prop
## 7 dawn
## 8 dog
## 9 arc
## 10 skirt
You can optionally specify columns to find unique combinations
Returns only the columns you specify
english %>%
select(RTlexdec:Word) %>%
distinct(Word) %>%
head(10)
## Word
## 1 doe
## 2 whore
## 3 stress
## 4 pork
## 5 plug
## 6 prop
## 7 dawn
## 8 dog
## 9 arc
## 10 skirt
Unless you add .keep_all=TRUE argument
english %>%
select(RTlexdec:Word) %>%
distinct(Word, .keep_all=TRUE) %>%
head(10)
## RTlexdec RTnaming Familiarity Word
## 1 6.543754 6.145044 2.37 doe
## 2 6.397596 6.246882 4.43 whore
## 3 6.304942 6.143756 5.60 stress
## 4 6.424221 6.131878 3.87 pork
## 5 6.450597 6.198479 3.93 plug
## 6 6.531970 6.167726 3.27 prop
## 7 6.370586 6.123808 3.73 dawn
## 8 6.266859 6.096050 5.67 dog
## 9 6.608648 6.117657 3.10 arc
## 10 6.284843 6.179188 4.43 skirt
dplyrmutate() |
adds new columns calculated from existing columns |
select() |
selects columns based on their names |
rename() |
renames some columns |
select()Select columns based on their names
english %>%
select(RTlexdec, Familiarity, AgeSubject) %>%
head(10)
## RTlexdec Familiarity AgeSubject
## 1 6.543754 2.37 young
## 2 6.397596 4.43 young
## 3 6.304942 5.60 young
## 4 6.424221 3.87 young
## 5 6.450597 3.93 young
## 6 6.531970 3.27 young
## 7 6.370586 3.73 young
## 8 6.266859 5.67 young
## 9 6.608648 3.10 young
## 10 6.284843 4.43 young
Use : to select everything from one column to
another
english %>%
select(RTlexdec:AgeSubject) %>%
head(10)
## RTlexdec RTnaming Familiarity Word AgeSubject
## 1 6.543754 6.145044 2.37 doe young
## 2 6.397596 6.246882 4.43 whore young
## 3 6.304942 6.143756 5.60 stress young
## 4 6.424221 6.131878 3.87 pork young
## 5 6.450597 6.198479 3.93 plug young
## 6 6.531970 6.167726 3.27 prop young
## 7 6.370586 6.123808 3.73 dawn young
## 8 6.266859 6.096050 5.67 dog young
## 9 6.608648 6.117657 3.10 arc young
## 10 6.284843 6.179188 4.43 skirt young
Use logical operators like & or ! to
identify the subset of columns you want to select
english %>%
select(!RTlexdec:AgeSubject) %>%
head(10)
## WordCategory WrittenFrequency WrittenSpokenFrequencyRatio FamilySize
## 1 N 3.912023 1.02165125 1.386294
## 2 N 4.521789 0.35048297 1.386294
## 3 N 6.505784 2.08935600 1.609438
## 4 N 5.017280 -0.52633389 1.945910
## 5 N 4.890349 -1.04454507 2.197225
## 6 N 4.770685 0.92480142 1.386294
## 7 N 6.383507 -0.01542102 1.098612
## 8 N 7.159292 -1.04821897 3.688879
## 9 N 4.890349 2.91626810 1.609438
## 10 N 5.929589 -0.07873252 1.791759
## DerivationalEntropy InflectionalEntropy NumberSimplexSynsets
## 1 0.14144 0.02114 0.6931472
## 2 0.42706 0.94198 1.0986123
## 3 0.06197 1.44339 2.4849066
## 4 0.43035 0.00000 1.0986123
## 5 0.35920 1.75393 2.4849066
## 6 0.06268 1.74730 1.6094379
## 7 0.00000 0.77898 1.7917595
## 8 1.03052 1.05129 2.0794415
## 9 0.15894 0.57715 1.6094379
## 10 0.06664 1.34797 1.9459101
## NumberComplexSynsets LengthInLetters Ncount MeanBigramFrequency
## 1 0.000000 3 8 7.036333
## 2 0.000000 5 5 9.537878
## 3 1.945910 6 0 9.883931
## 4 2.639057 4 8 8.309180
## 5 2.484907 4 3 7.943717
## 6 1.386294 4 9 8.349620
## 7 1.098612 4 6 7.699792
## 8 4.465908 3 13 7.139341
## 9 2.079442 3 3 7.747861
## 10 2.079442 5 3 8.561149
## FrequencyInitialDiphone ConspelV ConspelN ConphonV ConphonN ConfriendsV
## 1 12.02268 10 3.737670 41 8.837826 8
## 2 12.59780 20 7.870930 38 9.775825 20
## 3 13.30069 10 6.693324 13 7.040536 10
## 4 12.07807 5 6.677083 6 3.828641 4
## 5 11.92678 17 4.762174 17 4.762174 17
## 6 12.19724 19 6.234411 21 6.249975 19
## 7 11.32815 10 4.779123 13 8.864323 10
## 8 12.02268 13 4.795791 7 4.770685 6
## 9 13.29079 1 3.737670 11 6.091310 0
## 10 10.38890 7 4.624973 14 5.164786 7
## ConfriendsN ConffV ConffN ConfbV ConfbN NounFrequency VerbFrequency
## 1 3.295837 0.6931472 2.708050 3.496508 8.833900 49 0
## 2 7.870930 0.0000000 0.000000 2.944439 9.614738 142 0
## 3 6.693324 0.0000000 0.000000 1.386294 5.817111 565 473
## 4 3.526361 0.6931472 6.634633 1.098612 2.564949 150 0
## 5 4.762174 0.0000000 0.000000 0.000000 0.000000 170 120
## 6 6.234411 0.0000000 0.000000 1.098612 2.197225 125 280
## 7 4.779123 0.0000000 0.000000 1.386294 8.847504 582 110
## 8 3.761200 1.9459101 1.386294 0.000000 0.000000 2061 76
## 9 0.000000 0.0000000 0.000000 2.397895 5.993961 144 4
## 10 4.624973 0.0000000 0.000000 2.079442 4.304065 522 86
## CV Obstruent Frication Voice FrequencyInitialDiphoneWord
## 1 C obst burst voiced 10.129308
## 2 C obst frication voiceless 9.054388
## 3 C obst frication voiceless 12.422026
## 4 C obst burst voiceless 10.048151
## 5 C obst burst voiceless 11.796336
## 6 C obst burst voiceless 11.991567
## 7 C obst burst voiced 9.408125
## 8 C obst burst voiced 9.755336
## 9 V cont long voiced 6.424869
## 10 C obst frication voiceless 11.129011
## FrequencyInitialDiphoneSyllable CorrectLexdec
## 1 10.409763 27
## 2 9.148252 30
## 3 13.127395 30
## 4 11.003649 30
## 5 12.163092 26
## 6 12.436772 28
## 7 9.772410 30
## 8 10.255024 28
## 9 6.538140 25
## 10 11.578563 29
You can rename columns when you select them
english %>%
select(RTlex = RTlexdec, Familiarity, AgeSubject) %>%
head(10)
## RTlex Familiarity AgeSubject
## 1 6.543754 2.37 young
## 2 6.397596 4.43 young
## 3 6.304942 5.60 young
## 4 6.424221 3.87 young
## 5 6.450597 3.93 young
## 6 6.531970 3.27 young
## 7 6.370586 3.73 young
## 8 6.266859 5.67 young
## 9 6.608648 3.10 young
## 10 6.284843 4.43 young
rename()Keep all columns but rename one or more
english %>%
rename(RTlex = RTlexdec, WordCat = WordCategory) %>%
head(10)
## RTlex RTnaming Familiarity Word AgeSubject WordCat WrittenFrequency
## 1 6.543754 6.145044 2.37 doe young N 3.912023
## 2 6.397596 6.246882 4.43 whore young N 4.521789
## 3 6.304942 6.143756 5.60 stress young N 6.505784
## 4 6.424221 6.131878 3.87 pork young N 5.017280
## 5 6.450597 6.198479 3.93 plug young N 4.890349
## 6 6.531970 6.167726 3.27 prop young N 4.770685
## 7 6.370586 6.123808 3.73 dawn young N 6.383507
## 8 6.266859 6.096050 5.67 dog young N 7.159292
## 9 6.608648 6.117657 3.10 arc young N 4.890349
## 10 6.284843 6.179188 4.43 skirt young N 5.929589
## WrittenSpokenFrequencyRatio FamilySize DerivationalEntropy
## 1 1.02165125 1.386294 0.14144
## 2 0.35048297 1.386294 0.42706
## 3 2.08935600 1.609438 0.06197
## 4 -0.52633389 1.945910 0.43035
## 5 -1.04454507 2.197225 0.35920
## 6 0.92480142 1.386294 0.06268
## 7 -0.01542102 1.098612 0.00000
## 8 -1.04821897 3.688879 1.03052
## 9 2.91626810 1.609438 0.15894
## 10 -0.07873252 1.791759 0.06664
## InflectionalEntropy NumberSimplexSynsets NumberComplexSynsets
## 1 0.02114 0.6931472 0.000000
## 2 0.94198 1.0986123 0.000000
## 3 1.44339 2.4849066 1.945910
## 4 0.00000 1.0986123 2.639057
## 5 1.75393 2.4849066 2.484907
## 6 1.74730 1.6094379 1.386294
## 7 0.77898 1.7917595 1.098612
## 8 1.05129 2.0794415 4.465908
## 9 0.57715 1.6094379 2.079442
## 10 1.34797 1.9459101 2.079442
## LengthInLetters Ncount MeanBigramFrequency FrequencyInitialDiphone ConspelV
## 1 3 8 7.036333 12.02268 10
## 2 5 5 9.537878 12.59780 20
## 3 6 0 9.883931 13.30069 10
## 4 4 8 8.309180 12.07807 5
## 5 4 3 7.943717 11.92678 17
## 6 4 9 8.349620 12.19724 19
## 7 4 6 7.699792 11.32815 10
## 8 3 13 7.139341 12.02268 13
## 9 3 3 7.747861 13.29079 1
## 10 5 3 8.561149 10.38890 7
## ConspelN ConphonV ConphonN ConfriendsV ConfriendsN ConffV ConffN
## 1 3.737670 41 8.837826 8 3.295837 0.6931472 2.708050
## 2 7.870930 38 9.775825 20 7.870930 0.0000000 0.000000
## 3 6.693324 13 7.040536 10 6.693324 0.0000000 0.000000
## 4 6.677083 6 3.828641 4 3.526361 0.6931472 6.634633
## 5 4.762174 17 4.762174 17 4.762174 0.0000000 0.000000
## 6 6.234411 21 6.249975 19 6.234411 0.0000000 0.000000
## 7 4.779123 13 8.864323 10 4.779123 0.0000000 0.000000
## 8 4.795791 7 4.770685 6 3.761200 1.9459101 1.386294
## 9 3.737670 11 6.091310 0 0.000000 0.0000000 0.000000
## 10 4.624973 14 5.164786 7 4.624973 0.0000000 0.000000
## ConfbV ConfbN NounFrequency VerbFrequency CV Obstruent Frication
## 1 3.496508 8.833900 49 0 C obst burst
## 2 2.944439 9.614738 142 0 C obst frication
## 3 1.386294 5.817111 565 473 C obst frication
## 4 1.098612 2.564949 150 0 C obst burst
## 5 0.000000 0.000000 170 120 C obst burst
## 6 1.098612 2.197225 125 280 C obst burst
## 7 1.386294 8.847504 582 110 C obst burst
## 8 0.000000 0.000000 2061 76 C obst burst
## 9 2.397895 5.993961 144 4 V cont long
## 10 2.079442 4.304065 522 86 C obst frication
## Voice FrequencyInitialDiphoneWord FrequencyInitialDiphoneSyllable
## 1 voiced 10.129308 10.409763
## 2 voiceless 9.054388 9.148252
## 3 voiceless 12.422026 13.127395
## 4 voiceless 10.048151 11.003649
## 5 voiceless 11.796336 12.163092
## 6 voiceless 11.991567 12.436772
## 7 voiced 9.408125 9.772410
## 8 voiced 9.755336 10.255024
## 9 voiced 6.424869 6.538140
## 10 voiceless 11.129011 11.578563
## CorrectLexdec
## 1 27
## 2 30
## 3 30
## 4 30
## 5 26
## 6 28
## 7 30
## 8 28
## 9 25
## 10 29
mutate()Add new columns that are calculated from exising columns
english %>%
select(RTlexdec:AgeSubject) %>%
mutate(RTdiff = RTlexdec - RTnaming) %>%
head(10)
## RTlexdec RTnaming Familiarity Word AgeSubject RTdiff
## 1 6.543754 6.145044 2.37 doe young 0.3987099
## 2 6.397596 6.246882 4.43 whore young 0.1507144
## 3 6.304942 6.143756 5.60 stress young 0.1611859
## 4 6.424221 6.131878 3.87 pork young 0.2923421
## 5 6.450597 6.198479 3.93 plug young 0.2521181
## 6 6.531970 6.167726 3.27 prop young 0.3642442
## 7 6.370586 6.123808 3.73 dawn young 0.2467779
## 8 6.266859 6.096050 5.67 dog young 0.1708092
## 9 6.608648 6.117657 3.10 arc young 0.4909916
## 10 6.284843 6.179188 4.43 skirt young 0.1056547
Columns are added to the right by default, but you can specify where you’d like to add them by number or name
With .after
english %>%
select(RTlexdec:Familiarity) %>%
mutate(
RTdiff = RTlexdec - RTnaming,
.after = RTnaming
) %>%
head(10)
## RTlexdec RTnaming RTdiff Familiarity
## 1 6.543754 6.145044 0.3987099 2.37
## 2 6.397596 6.246882 0.1507144 4.43
## 3 6.304942 6.143756 0.1611859 5.60
## 4 6.424221 6.131878 0.2923421 3.87
## 5 6.450597 6.198479 0.2521181 3.93
## 6 6.531970 6.167726 0.3642442 3.27
## 7 6.370586 6.123808 0.2467779 3.73
## 8 6.266859 6.096050 0.1708092 5.67
## 9 6.608648 6.117657 0.4909916 3.10
## 10 6.284843 6.179188 0.1056547 4.43
With .before
english %>%
select(RTlexdec:Familiarity) %>%
mutate(
RTdiff = RTlexdec - RTnaming,
.before = RTlexdec
) %>%
head(10)
## RTdiff RTlexdec RTnaming Familiarity
## 1 0.3987099 6.543754 6.145044 2.37
## 2 0.1507144 6.397596 6.246882 4.43
## 3 0.1611859 6.304942 6.143756 5.60
## 4 0.2923421 6.424221 6.131878 3.87
## 5 0.2521181 6.450597 6.198479 3.93
## 6 0.3642442 6.531970 6.167726 3.27
## 7 0.2467779 6.370586 6.123808 3.73
## 8 0.1708092 6.266859 6.096050 5.67
## 9 0.4909916 6.608648 6.117657 3.10
## 10 0.1056547 6.284843 6.179188 4.43
dplyrgroup_by() |
used to divide your dataset into groups |
summarise() |
often used after group_by() to calculate summary
statistics on grouped data |
ungroup() |
used to remove the grouping |
group_by()Divide your dataset into groups
english %>%
select(RTlexdec, Familiarity, AgeSubject) %>%
group_by(AgeSubject)
## # A tibble: 4,568 × 3
## # Groups: AgeSubject [2]
## RTlexdec Familiarity AgeSubject
## <dbl> <dbl> <fct>
## 1 6.54 2.37 young
## 2 6.40 4.43 young
## 3 6.30 5.6 young
## 4 6.42 3.87 young
## 5 6.45 3.93 young
## 6 6.53 3.27 young
## 7 6.37 3.73 young
## 8 6.27 5.67 young
## 9 6.61 3.1 young
## 10 6.28 4.43 young
## # ℹ 4,558 more rows
Can group by more than one variable
english %>%
select(RTlexdec, Familiarity, AgeSubject, Voice) %>%
group_by(AgeSubject, Voice)
## # A tibble: 4,568 × 4
## # Groups: AgeSubject, Voice [4]
## RTlexdec Familiarity AgeSubject Voice
## <dbl> <dbl> <fct> <fct>
## 1 6.54 2.37 young voiced
## 2 6.40 4.43 young voiceless
## 3 6.30 5.6 young voiceless
## 4 6.42 3.87 young voiceless
## 5 6.45 3.93 young voiceless
## 6 6.53 3.27 young voiceless
## 7 6.37 3.73 young voiced
## 8 6.27 5.67 young voiced
## 9 6.61 3.1 young voiced
## 10 6.28 4.43 young voiceless
## # ℹ 4,558 more rows
Does not change original df, adds groups attribute
grouped_english <- english %>%
select(RTlexdec, Familiarity, AgeSubject) %>%
group_by(AgeSubject)
grouped_english
## # A tibble: 4,568 × 3
## # Groups: AgeSubject [2]
## RTlexdec Familiarity AgeSubject
## <dbl> <dbl> <fct>
## 1 6.54 2.37 young
## 2 6.40 4.43 young
## 3 6.30 5.6 young
## 4 6.42 3.87 young
## 5 6.45 3.93 young
## 6 6.53 3.27 young
## 7 6.37 3.73 young
## 8 6.27 5.67 young
## 9 6.61 3.1 young
## 10 6.28 4.43 young
## # ℹ 4,558 more rows
attr(english, "groups")
## NULL
attr(grouped_english, "groups")
## # A tibble: 2 × 2
## AgeSubject .rows
## <fct> <list<int>>
## 1 old [2,284]
## 2 young [2,284]
. . .
summarise()Often used after group_by() to calculate summary stats
on grouped data
summary_english <- english %>%
select(RTlexdec, Familiarity, AgeSubject) %>%
group_by(AgeSubject) %>%
summarise(n = n())
summary_english
## # A tibble: 2 × 2
## AgeSubject n
## <fct> <int>
## 1 old 2284
## 2 young 2284
You can add any number of summary stats
summary_english <- english %>%
select(RTlexdec, Familiarity, AgeSubject, Voice) %>%
group_by(AgeSubject, Voice) %>%
summarise(
n = n(),
mean = mean(RTlexdec)
)
## `summarise()` has grouped output by 'AgeSubject'. You can override using the
## `.groups` argument.
summary_english
## # A tibble: 4 × 4
## # Groups: AgeSubject [2]
## AgeSubject Voice n mean
## <fct> <fct> <int> <dbl>
## 1 old voiced 1030 6.67
## 2 old voiceless 1254 6.66
## 3 young voiced 1030 6.44
## 4 young voiceless 1254 6.43
Use the .groups argument to drop or keep grouping in
returned dataframe
Drop all groups
summary_english <- english %>%
select(RTlexdec, Familiarity, AgeSubject, Voice) %>%
group_by(AgeSubject, Voice) %>%
summarise(
n = n(),
mean = mean(RTlexdec),
.groups = "drop"
)
summary_english
## # A tibble: 4 × 4
## AgeSubject Voice n mean
## <fct> <fct> <int> <dbl>
## 1 old voiced 1030 6.67
## 2 old voiceless 1254 6.66
## 3 young voiced 1030 6.44
## 4 young voiceless 1254 6.43
Keep all groups
summary_english <- english %>%
select(RTlexdec, Familiarity, AgeSubject, Voice) %>%
group_by(AgeSubject, Voice) %>%
summarise(
n = n(),
mean = mean(RTlexdec),
.groups = "keep"
)
summary_english
## # A tibble: 4 × 4
## # Groups: AgeSubject, Voice [4]
## AgeSubject Voice n mean
## <fct> <fct> <int> <dbl>
## 1 old voiced 1030 6.67
## 2 old voiceless 1254 6.66
## 3 young voiced 1030 6.44
## 4 young voiceless 1254 6.43
Or use the new .by argument instead of
group_by() to return an ungrouped dataframe
Drop all groups
summary_english <- english %>%
select(RTlexdec, Familiarity, AgeSubject, Voice) %>%
group_by(AgeSubject, Voice) %>%
summarise(
n = n(),
mean = mean(RTlexdec),
.groups = "drop"
)
summary_english
## # A tibble: 4 × 4
## AgeSubject Voice n mean
## <fct> <fct> <int> <dbl>
## 1 old voiced 1030 6.67
## 2 old voiceless 1254 6.66
## 3 young voiced 1030 6.44
## 4 young voiceless 1254 6.43
Or just use by!
summary_english <- english %>%
select(RTlexdec, Familiarity, AgeSubject, Voice) %>%
summarise(
n = n(),
mean = mean(RTlexdec),
.by = c(AgeSubject, Voice)
)
summary_english
## AgeSubject Voice n mean
## 1 young voiced 1030 6.444450
## 2 young voiceless 1254 6.434955
## 3 old voiced 1030 6.666049
## 4 old voiceless 1254 6.656777
ungroup()Can also remove grouping after with ungroup()
Drop all groups
summary_english <- english %>%
select(RTlexdec, Familiarity, AgeSubject, Voice) %>%
group_by(AgeSubject, Voice) %>%
summarise(
n = n(),
mean = mean(RTlexdec),
.groups = "drop"
)
summary_english
## # A tibble: 4 × 4
## AgeSubject Voice n mean
## <fct> <fct> <int> <dbl>
## 1 old voiced 1030 6.67
## 2 old voiceless 1254 6.66
## 3 young voiced 1030 6.44
## 4 young voiceless 1254 6.43
Or ungroup after
summary_english <- english %>%
select(RTlexdec, Familiarity, AgeSubject, Voice) %>%
summarise(
n = n(),
mean = mean(RTlexdec),
.by = c(AgeSubject, Voice)
) %>%
ungroup()
summary_english
## AgeSubject Voice n mean
## 1 young voiced 1030 6.444450
## 2 young voiceless 1254 6.434955
## 3 old voiced 1030 6.666049
## 4 old voiceless 1254 6.656777