The Rmarkdown for this class is on github
The tidyverse is a collection of packages that share similar design philosophy, syntax, and data structures. The packages are largely developed by the same team that builds Rstudio.
Some key packages that we will touch on in this course:
readr
: functions for data import and export
ggplot2
: plotting based on the “grammar of graphics”
dplyr
: functions to manipulate tabular data
tidyr
: functions to help reshape data into a tidy format
stringr
: functions for working with strings
tibble
: a redesigned data.frame
To use an R package in an analysis we need to load the package using the library()
function. This needs to be done once in each R session and it is a good idea to do this at the beginning of your Rmarkdown. For teaching purposes I will however sometimes load a package when I introduce a function from a package.
A tibble
is a re-imagining of the base R data.frame
. It has a few differences from the data.frame
.The biggest differences are that it doesn’t have row.names
and it has an enhanced print
method. If interested in learning more, see the tibble vignette.
Compare data_df
to data_tbl
.
data_df <- data.frame(a = 1:3,
b = letters[1:3],
c = c(TRUE, FALSE, TRUE),
row.names = c("ob_1", "ob_2", "ob_3"))
data_df
data_tbl <- as_tibble(data_df)
data_tbl
When you work with tidyverse functions it is a good practice to convert data.frames to tibbles. In practice many functions will work interchangeably with either base data.frames or tibble, provided that they don’t use row names.
If a data.frame has row names, you can preserve these by moving them into a column before converting to a tibble using the rownames_to_column()
from tibble
.
head(mtcars)
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars_tbl <- rownames_to_column(mtcars, "vehicle")
mtcars_tbl <- as_tibble(mtcars_tbl)
mtcars_tbl
# A tibble: 32 × 12
vehicle mpg cyl disp hp drat wt qsec vs am gear carb
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Mazda RX4 21 6 160 110 3.9 2.62 16.5 0 1 4 4
2 Mazda RX4 … 21 6 160 110 3.9 2.88 17.0 0 1 4 4
3 Datsun 710 22.8 4 108 93 3.85 2.32 18.6 1 1 4 1
4 Hornet 4 D… 21.4 6 258 110 3.08 3.22 19.4 1 0 3 1
5 Hornet Spo… 18.7 8 360 175 3.15 3.44 17.0 0 0 3 2
6 Valiant 18.1 6 225 105 2.76 3.46 20.2 1 0 3 1
7 Duster 360 14.3 8 360 245 3.21 3.57 15.8 0 0 3 4
8 Merc 240D 24.4 4 147. 62 3.69 3.19 20 1 0 4 2
9 Merc 230 22.8 4 141. 95 3.92 3.15 22.9 1 0 4 2
10 Merc 280 19.2 6 168. 123 3.92 3.44 18.3 1 0 4 4
# ℹ 22 more rows
If you don’t need the rownames, then you can use the as_tibble()
function directly.
mtcars_tbl <- as_tibble(mtcars)
So far we have only worked with built in or hand generated datasets, now we will discuss how to read data files into R.
The readr
package provides a series of functions for importing or writing data in common text formats.
read_csv()
: comma-separated values (CSV) files
read_tsv()
: tab-separated values (TSV) files
read_delim()
: delimited files (CSV and TSV are important special cases)
read_fwf()
: fixed-width files
read_table()
: whitespace-separated files
These functions are quicker and have better defaults than the base R equivalents (e.g. read.table
or read.csv
). These functions also directly output tibbles rather than base R data.drames
The readr checksheet provides a concise overview of the functionality in the package.
To illustrate how to use readr we will load a .csv
file containing information about airline flights from 2014.
First we will download the data files. You can download this data manually from github. However we will use R to download the dataset using the download.file()
base R function.
# test if file exists, if it doesn't then download the file.
if(!file.exists("flights14.csv")) {
file_url <- "https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv"
download.file(file_url, "flights14.csv")
}
You should now have a file called “flights14.csv” in your working directory (the same directory as the Rmarkdown). To read this data into R, we can use the read_csv()
function. The defaults for this function often work for many datasets.
flights <- read_csv("flights14.csv")
flights
# A tibble: 253,316 × 11
year month day dep_delay arr_delay carrier origin dest air_time distance
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 2014 1 1 14 13 AA JFK LAX 359 2475
2 2014 1 1 -3 13 AA JFK LAX 363 2475
3 2014 1 1 2 9 AA JFK LAX 351 2475
4 2014 1 1 -8 -26 AA LGA PBI 157 1035
5 2014 1 1 2 1 AA JFK LAX 350 2475
6 2014 1 1 4 0 AA EWR LAX 339 2454
7 2014 1 1 -2 -18 AA JFK LAX 338 2475
8 2014 1 1 -3 -14 AA JFK LAX 356 2475
9 2014 1 1 -1 -17 AA JFK MIA 161 1089
10 2014 1 1 -2 -14 AA JFK SEA 349 2422
# ℹ 253,306 more rows
# ℹ 1 more variable: hour <dbl>
There are a few commonly used arguments:
col_names
: if the data doesn’t have column names, you can provide them (or skip them).
col_types
: set this if the data type of a column is incorrectly inferred by readr
comment
: if there are comment lines in the file, such as a header line prefixed with #
, you want to skip, set this to #
.
skip
: # of lines to skip before reading in the data.
n_max
: maximum number of lines to read, useful for testing reading in large datasets.
The readr functions will also automatically uncompress gzipped or zipped datasets, and additionally can read data directly from a URL.
read_csv("https://raw.githubusercontent.com/Rdatatable/data.table/master/vignettes/flights14.csv")
There are equivalent functions for writing data.frames from R to files:
write_csv
, write_tsv
, write_delim
.
The readxl
package can read data from excel files and is included in the tidyverse. The read_excel()
function is the main function for reading data.
The openxlsx
package, which is not part of tidyverse but is on CRAN, can write excel files. The write.xlsx()
function is the main function for writing data to excel spreadsheets.
Often it is useful to store R objects as files on disk so that the R objects can be reloaded into R. These could be large processed datasets, intermediate results, or complex data structures that are not easily stored in rectangular text formats such as csv files.
R provides the saveRDS()
and readRDS()
functions for storing and retrieving data in binary formats.
saveRDS(flights, "flights.rds") # save single object into a file
df <- readRDS("flights.rds") # read object back into R
df
# A tibble: 253,316 × 11
year month day dep_delay arr_delay carrier origin dest air_time distance
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 2014 1 1 14 13 AA JFK LAX 359 2475
2 2014 1 1 -3 13 AA JFK LAX 363 2475
3 2014 1 1 2 9 AA JFK LAX 351 2475
4 2014 1 1 -8 -26 AA LGA PBI 157 1035
5 2014 1 1 2 1 AA JFK LAX 350 2475
6 2014 1 1 4 0 AA EWR LAX 339 2454
7 2014 1 1 -2 -18 AA JFK LAX 338 2475
8 2014 1 1 -3 -14 AA JFK LAX 356 2475
9 2014 1 1 -1 -17 AA JFK MIA 161 1089
10 2014 1 1 -2 -14 AA JFK SEA 349 2422
# ℹ 253,306 more rows
# ℹ 1 more variable: hour <dbl>
If you want to save/load multiple objects you can use save()
and load()
.
save(flights, df, file = "robjs.rda") # save flight_df and df
load()
will load the data into the environment with the same objects names used when saving the objects.
View()
can be used to open an excel like view of a data.frame. This is a good way to quickly look at the data. glimpse()
or str()
give an additional view of the data.
View(flights)
str(flights)
glimpse(flights)
Additional R functions to help with exploring data.frames (and tibbles):
Useful base R functions for exploring values
In the first two lectures we introduced how to subset vectors, data.frames, and matrices using base R functions. These approaches are flexible, succinct, and stable, meaning that these approaches will be supported and work in R in the future.
Some criticisms of using base R are that the syntax is hard to read, it tends to be verbose, and it is difficult to learn. dplyr, and other tidyverse packages, offer alternative approaches which many find easier to use.
Some key differences between base R and the approaches in dplyr (and tidyverse)
Use of the tibble version of data.frame
dplyr functions operate on data.frame/tibbles rather than individual vectors
dplyr allows you to specify column names without quotes
dplyr uses different functions (verbs) to accomplish the various tasks performed by the bracket [
base R syntax
dplyr and related functions recognized “grouped” operations on data.frames, enabling operations on different groups of rows in a data.frame
dplyr
provides a suite of functions for manipulating data
in tibbles.
Operations on Rows:
- filter()
chooses rows based on column values
- arrange()
changes the order of the rows
- distinct()
selects distinct/unique rows
- slice()
chooses rows based on location
Operations on Columns:
- select()
changes whether or not a column is included
- rename()
changes the name of columns
- mutate()
changes the values of columns and creates new columns
Operations on groups of rows:
- summarise()
collapses a group into a single row
Returning to our flights
data. Let’s use filter()
to select certain rows.
filter(tibble, <expression that produces a logical vector>, ...)
filter(flights, dest == "LAX") # select rows where the `dest` column is equal to `LAX
# A tibble: 14,434 × 11
year month day dep_delay arr_delay carrier origin dest air_time distance
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 2014 1 1 14 13 AA JFK LAX 359 2475
2 2014 1 1 -3 13 AA JFK LAX 363 2475
3 2014 1 1 2 9 AA JFK LAX 351 2475
4 2014 1 1 2 1 AA JFK LAX 350 2475
5 2014 1 1 4 0 AA EWR LAX 339 2454
6 2014 1 1 -2 -18 AA JFK LAX 338 2475
7 2014 1 1 -3 -14 AA JFK LAX 356 2475
8 2014 1 1 142 133 AA JFK LAX 345 2475
9 2014 1 1 -4 11 B6 JFK LAX 349 2475
10 2014 1 1 3 -10 B6 JFK LAX 349 2475
# ℹ 14,424 more rows
# ℹ 1 more variable: hour <dbl>
Multiple conditions can be used to select rows. For example we can select rows where the dest
column is equal to LAX
and the origin
is equal to EWR
. You can either use the &
operator, or supply multiple arguments.
We can select rows where the dest
column is equal to LAX
or the origin
is equal to EWR
using the |
operator.
filter(flights, dest == "LAX" | origin == "EWR")
The %in%
operator is useful for identifying rows with entries matching those in a vector of possibilities.
Try it out:
dep_delay
)....
arrange()
can be used to sort the data based on values in a single column or multiple columns
arrange(tibble, <columns_to_sort_by>)
For example, let’s find the flight with the shortest amount of air time by arranging the table based on the air_time
(flight time in minutes).
Try it out:
select()
is a simple function that subsets the tibble to keep certain columns.
select(tibble, <columns_to_keep>)
select(flights, origin, dest)
# A tibble: 253,316 × 2
origin dest
<chr> <chr>
1 JFK LAX
2 JFK LAX
3 JFK LAX
4 LGA PBI
5 JFK LAX
6 EWR LAX
7 JFK LAX
8 JFK LAX
9 JFK MIA
10 JFK SEA
# ℹ 253,306 more rows
the :
operator can select a range of columns, such as the columns from air_time
to hour
. The !
operator selects columns not listed.
There is a suite of utilities in the tidyverse to help with select columns with names that: matches()
, starts_with()
, ends_with()
, contains()
, any_of()
, and all_of()
. everything()
is also useful as a placeholder for all columns not explicitly listed. See help ?select
In general, when working with the tidyverse, you don’t need to quote the names of columns. In the example above, we needed quotes because “delay” is not a column name in the flights tibble.
mutate()
allows you to add new columns to the tibble.
mutate(tibble, new_column_name = expression, ...)
mutate(flights, total_delay = dep_delay + arr_delay)
# A tibble: 253,316 × 12
year month day dep_delay arr_delay carrier origin dest air_time distance
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 2014 1 1 14 13 AA JFK LAX 359 2475
2 2014 1 1 -3 13 AA JFK LAX 363 2475
3 2014 1 1 2 9 AA JFK LAX 351 2475
4 2014 1 1 -8 -26 AA LGA PBI 157 1035
5 2014 1 1 2 1 AA JFK LAX 350 2475
6 2014 1 1 4 0 AA EWR LAX 339 2454
7 2014 1 1 -2 -18 AA JFK LAX 338 2475
8 2014 1 1 -3 -14 AA JFK LAX 356 2475
9 2014 1 1 -1 -17 AA JFK MIA 161 1089
10 2014 1 1 -2 -14 AA JFK SEA 349 2422
# ℹ 253,306 more rows
# ℹ 2 more variables: hour <dbl>, total_delay <dbl>
We can’t see the new column, so we add a select command to examine the columns of interest.
# A tibble: 253,316 × 3
dep_delay arr_delay total_delay
<dbl> <dbl> <dbl>
1 14 13 27
2 -3 13 10
3 2 9 11
4 -8 -26 -34
5 2 1 3
6 4 0 4
7 -2 -18 -20
8 -3 -14 -17
9 -1 -17 -18
10 -2 -14 -16
# ℹ 253,306 more rows
Multiple new columns can be made, and you can refer to columns made in preceding statements.
Try it out:
air_time
) in hours rather than in minutes, add as a new column.mutate(flights, flight_time = air_time / 60)
# A tibble: 253,316 × 12
year month day dep_delay arr_delay carrier origin dest air_time distance
<dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl>
1 2014 1 1 14 13 AA JFK LAX 359 2475
2 2014 1 1 -3 13 AA JFK LAX 363 2475
3 2014 1 1 2 9 AA JFK LAX 351 2475
4 2014 1 1 -8 -26 AA LGA PBI 157 1035
5 2014 1 1 2 1 AA JFK LAX 350 2475
6 2014 1 1 4 0 AA EWR LAX 339 2454
7 2014 1 1 -2 -18 AA JFK LAX 338 2475
8 2014 1 1 -3 -14 AA JFK LAX 356 2475
9 2014 1 1 -1 -17 AA JFK MIA 161 1089
10 2014 1 1 -2 -14 AA JFK SEA 349 2422
# ℹ 253,306 more rows
# ℹ 2 more variables: hour <dbl>, flight_time <dbl>
summarize()
is a function that will collapse the data from a column into a summary value based on a function that takes a vector and returns a single value (e.g. mean(), sum(), median()). It is not very useful yet, but will be very powerful when we discuss grouped operations.
# A tibble: 1 × 2
avg_arr_delay med_air_time
<dbl> <dbl>
1 8.15 134
All of the functionality described above can be easily expressed in base R syntax (see examples here). However, where dplyr really shines is the ability to apply the functions above to groups of data within each data frame.
We can establish groups within the data using group_by()
. The functions mutate()
, summarize()
, and optionally arrange()
will instead operate on each group independently rather than all of the rows.
Common approaches: group_by -> summarize: calculate summaries per group group_by -> mutate: calculate summaries per group and add as new column to original tibble
group_by(tibble, <columns_to_establish_groups>)
group_by(flights, carrier) # notice the new "Groups:" metadata.
# calculate average dep_delay per carrier
group_by(flights, carrier) |>
summarize(avg_dep_delay = mean(dep_delay))
# calculate average arr_delay per carrier at each airport
group_by(flights, carrier, origin) |>
summarize(avg_dep_delay = mean(dep_delay))
# calculate # of flights between each origin and destination city, per carrier, and average air time.
# n() is a special function that returns the # of rows per group
group_by(flights, carrier, origin, dest) |>
summarize(n_flights = n(),
mean_air_time = mean(air_time))
Here are some questions that we can answer using grouped operations in a few lines of dplyr code.
air_time
between each origin airport and destination airport?# A tibble: 221 × 3
# Groups: origin [3]
origin dest avg_air_time
<chr> <chr> <dbl>
1 EWR ALB 31.4
2 EWR ANC 424.
3 EWR ATL 111.
4 EWR AUS 210.
5 EWR AVL 89.7
6 EWR AVP 25
7 EWR BDL 25.4
8 EWR BNA 115.
9 EWR BOS 40.1
10 EWR BQN 197.
# ℹ 211 more rows
air_time
) to fly between between on average? the shortest?group_by(flights, origin, dest) |>
summarize(avg_air_time = mean(air_time)) |>
arrange(desc(avg_air_time)) |>
head(1)
# A tibble: 1 × 3
# Groups: origin [1]
origin dest avg_air_time
<chr> <chr> <dbl>
1 JFK HNL 625.
group_by(flights, origin, dest) |>
summarize(avg_air_time = mean(air_time)) |>
arrange(avg_air_time) |>
head(1)
# A tibble: 1 × 3
# Groups: origin [1]
origin dest avg_air_time
<chr> <chr> <dbl>
1 EWR AVP 25
Try it out:
air_time
) on average from JFK to LAX?stringr
is a package for working with strings (i.e. character vectors). It provides a consistent syntax for string manipulation and can perform many routine tasks:
str_c
: concatenate strings (similar to paste()
in base R)
str_count
: count occurrence of a substring in a string
str_subset
: keep strings with a substring
str_replace
: replace a string with another string
str_split
: split a string into multiple pieces based on a string
library(stringr)
some_words <- c("a sentence", "with a ", "needle in a", "haystack")
str_detect(some_words, "needle") # use with dplyr::filter
str_subset(some_words, "needle")
str_replace(some_words, "needle", "pumpkin")
str_replace_all(some_words, "a", "A")
str_c(some_words, collapse = " ")
str_c(some_words, " words words words", " anisfhlsdihg")
str_count(some_words, "a")
str_split(some_words, " ")
stringr uses regular expressions to pattern match strings. This means that you can perform complex matching to the strings of interest. Additionally this means that there are special characters with behaviors that may be surprising if you are unaware of regular expressions.
A useful resource when using regular expressions is https://regex101.com
complex_strings <- c("10101-howdy", "34-world", "howdy-1010", "world-.")
# keep words with a series of #s followed by a dash, + indicates one or more occurrences.
str_subset(complex_strings, "[0-9]+-")
# keep words with a dash followed by a series of #s
str_subset(complex_strings, "-[0-9]+")
str_subset(complex_strings, "^howdy") # keep words starting with howdy
str_subset(complex_strings, "howdy$") # keep words ending with howdy
str_subset(complex_strings, ".") # . signifies any character
str_subset(complex_strings, "\\.") # need to use backticks to match literal special character
Let’s use dplyr and stringr together.
Which destinations contain an “LL” in their 3 letter code?
# A tibble: 1 × 1
dest
<chr>
1 FLL
Which 3-letter destination codes start with H?
filter(flights, str_detect(dest, "^H")) |>
select(dest) |>
unique()
# A tibble: 4 × 1
dest
<chr>
1 HOU
2 HNL
3 HDN
4 HYA
Let’s make a new column that combines the origin
and dest
columns.
mutate(flights, new_col = str_c(origin, ":", dest)) |>
select(new_col, everything())
# A tibble: 253,316 × 12
new_col year month day dep_delay arr_delay carrier origin dest air_time
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl>
1 JFK:LAX 2014 1 1 14 13 AA JFK LAX 359
2 JFK:LAX 2014 1 1 -3 13 AA JFK LAX 363
3 JFK:LAX 2014 1 1 2 9 AA JFK LAX 351
4 LGA:PBI 2014 1 1 -8 -26 AA LGA PBI 157
5 JFK:LAX 2014 1 1 2 1 AA JFK LAX 350
6 EWR:LAX 2014 1 1 4 0 AA EWR LAX 339
7 JFK:LAX 2014 1 1 -2 -18 AA JFK LAX 338
8 JFK:LAX 2014 1 1 -3 -14 AA JFK LAX 356
9 JFK:MIA 2014 1 1 -1 -17 AA JFK MIA 161
10 JFK:SEA 2014 1 1 -2 -14 AA JFK SEA 349
# ℹ 253,306 more rows
# ℹ 2 more variables: distance <dbl>, hour <dbl>
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Denver
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.5.1 tibble_3.2.1 dplyr_1.1.3 readr_2.1.4
loaded via a namespace (and not attached):
[1] bit_4.0.5 jsonlite_1.8.7 compiler_4.3.1 crayon_1.5.2
[5] tidyselect_1.2.0 parallel_4.3.1 jquerylib_0.1.4 yaml_2.3.7
[9] fastmap_1.1.1 R6_2.5.1 generics_0.1.3 knitr_1.45
[13] distill_1.6 bslib_0.5.1 pillar_1.9.0 tzdb_0.4.0
[17] rlang_1.1.2 utf8_1.2.4 cachem_1.0.8 stringi_1.8.1
[21] xfun_0.41 sass_0.4.7 bit64_4.0.5 memoise_2.0.1
[25] cli_3.6.1 withr_2.5.2 magrittr_2.0.3 digest_0.6.33
[29] vroom_1.6.4 rstudioapi_0.15.0 hms_1.1.3 lifecycle_1.0.4
[33] vctrs_0.6.4 downlit_0.4.3 evaluate_0.23 glue_1.6.2
[37] fansi_1.0.5 rmarkdown_2.25 tools_4.3.1 pkgconfig_2.0.3
[41] htmltools_0.5.7
The content of this class borrows heavily from previous tutorials:
R code style guide: http://adv-r.had.co.nz/Style.html
Tutorial organization: https://github.com/sjaganna/molb7910-2019
Other R tutorials: https://github.com/matloff/fasteR https://r4ds.had.co.nz/index.html https://bookdown.org/rdpeng/rprogdatascience/