The Rmarkdown for this class is on github
tidyr
package for “tidying” rectangular datadplyr
“Data Scientists spend up to 80% of the time on data cleaning and 20 percent of their time on actual data analysis.” – Exploratory Data Mining and Data Cleaning. Dasu and Johnson
Data can be represented in multiple formats. Today we will discuss two common tabular formats for organizing data for analysis.
Consider the following dataset, which contains population estimates for countries throughout history. This representation of data is commonly referred to as ‘wide’ data format, which is a matrix-like format containing samples as rows and features as columns, with values associated with each observation of a sample and feature.
# A tibble: 197 × 302
country `1800` `1801` `1802` `1803` `1804` `1805` `1806` `1807` `1808` `1809`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghan… 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6
2 Angola 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6
3 Albania 4 e5 4.02e5 4.04e5 4.05e5 4.07e5 4.09e5 4.11e5 4.13e5 4.14e5 4.16e5
4 Andorra 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3
5 UAE 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4
6 Argent… 5.34e5 5.20e5 5.06e5 4.92e5 4.79e5 4.66e5 4.53e5 4.41e5 4.29e5 4.17e5
7 Armenia 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5
8 Antigu… 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4
9 Austra… 2 e5 2.05e5 2.11e5 2.16e5 2.22e5 2.27e5 2.33e5 2.39e5 2.46e5 2.52e5
10 Austria 3 e6 3.02e6 3.04e6 3.05e6 3.07e6 3.09e6 3.11e6 3.12e6 3.14e6 3.16e6
# ℹ 187 more rows
# ℹ 291 more variables: `1810` <dbl>, `1811` <dbl>, `1812` <dbl>, `1813` <dbl>,
# `1814` <dbl>, `1815` <dbl>, `1816` <dbl>, `1817` <dbl>, `1818` <dbl>,
# `1819` <dbl>, `1820` <dbl>, `1821` <dbl>, `1822` <dbl>, `1823` <dbl>,
# `1824` <dbl>, `1825` <dbl>, `1826` <dbl>, `1827` <dbl>, `1828` <dbl>,
# `1829` <dbl>, `1830` <dbl>, `1831` <dbl>, `1832` <dbl>, `1833` <dbl>,
# `1834` <dbl>, `1835` <dbl>, `1836` <dbl>, `1837` <dbl>, `1838` <dbl>, …
The wide matrix-like format is very useful and a common format used for statistics and machine learning. Matrices can take advantage of optimized numerical routines and are the data representation of mathematical matrices. We will work with matrices later in class, particularly with their use to generate heatmaps.
Representing data in a matrix however has a few practical implications:
There is only 1 type of data stored in a matrix-like representation (e.g. each cell is the same unit of observation, the population per country). To store additional related data types (e.g. the countries GDP each year) you need to place each new value in an independent matrix.
The matrix-like format does not easily lend itself to more complicated summaries. For example, what if we wanted to average the GDP values for each decade or century? We would have to write rather complicated code to parse out subsets of columns for each time period, average them, then merge them into a summary matrix.
Data in a matrix can be instead formatted into a long (also called “tidy”) format.
#> # A tibble: 10 × 3
#> country year population
#> <chr> <chr> <dbl>
#> 1 Afghanistan 1800 3280000
#> 2 Afghanistan 1801 3280000
#> 3 Afghanistan 1802 3280000
#> 4 Afghanistan 1803 3280000
#> 5 Afghanistan 1804 3280000
#> 6 Afghanistan 1805 3280000
#> 7 Afghanistan 1806 3280000
#> 8 Afghanistan 1807 3280000
#> 9 Afghanistan 1808 3280000
#> 10 Afghanistan 1809 3280000
The long format of this data convert the many columns of a matrix into a 3 column data.frame containing 3 variables (country
, year
, and population
).
“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham
A tidy dataset is structured in a manner to be most effectively processed in R using the tidyverse. For example, with the population dataset, instead of having to provide logic to process 100s of columns, instead there are only 3 columns.
Most data tables that you’ve worked with are probably not tidy. It takes experience to understand the best way to format the data for data processing. As you work more in R and the tidyverse this will become more natural.
Tidy data has the following attributes:
What is a variable, what is an observation, and what is a value?
A value is a number or word, e.g. the population.
Every value belongs to a variable and an observation, e.g. the population value observed in Austria in the year 1910.
A variable contains all values that measure the same attribute (e.g. height, temperature, duration, magnitude) across units. (e.g. Austria is a value of the country variable, 1910 is a value of the year variable).
An observation contains all values measured on the same unit across attributes (e.g observations about Austria in 1910).
Shown below is a simplified data table in a tidy format, provided by the tidyr
package. This data table shows the # of TB cases documented by the WHO in a few countries in the years 1999 and 2000.
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
The same data, represented in wide, matrix-like format, would require 2 tables:
e.g a table with the cases
values per country.
table4a
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
e.g a table with the population
values per country
table4b
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 19987071 20595360
2 Brazil 172006362 174504898
3 China 1272915272 1280428583
What advantages does the tidy format provide?
Easy to generate summaries of the data.
e.g. via group_by()
-> summarize()
Easy to plot the data using the ggplot2 framework (more on that in later classes)
Very easy to join multiple related data frames based on key values.
Some disadvantages:
Not space efficient
Not intuitive
Doesn’t interface well with traditional machine learning and statistical approaches.
The tidyr
package provides functionality to convert datasets into tidy formats.
pivot_longer()
: convert wide data to long datapivot_wider()
: convert long data to wide dataseparate()
: split a single column into multiple columnsThe pivot_longer
function requires specifying the columns to pivot using the tidyselect
syntax. This syntax is used elsewhere in the tidyverse and is a useful shorthand to avoid listing all columns of interest.
pivot_longer(tbl, cols = <...>)
table4a
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
pivot_longer(table4a, cols = `1999`:`2000`) # pivot columns from 1999 -> 2000
# A tibble: 6 × 3
country name value
<chr> <chr> <dbl>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
pivot_longer(table4a, cols = -country) # pivot all columns not matching country
# A tibble: 6 × 3
country name value
<chr> <chr> <dbl>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
Let’s try it out on the pop_wide
population data
pop_long <- pivot_longer(pop_wide, cols = -country)
pop_long <- pivot_longer(pop_wide,
cols = -country,
names_to = "year",
values_to = "population")
Why is the useful? Well now we can quickly use dplyr
to answer questions, such
as what is the average population per country across all years?
# A tibble: 197 × 2
country mean_population
<chr> <dbl>
1 Afghanistan 28038306.
2 Albania 1530495.
3 Algeria 23736578.
4 Andorra 31687.
5 Angola 27240465.
6 Antigua and Barbuda 58430.
7 Argentina 22730847.
8 Armenia 1637548.
9 Australia 13964223.
10 Austria 6573422.
# ℹ 187 more rows
pivot_wider(tbl, names_from = <...>, values_from = <...>)
names_from
: the column whose values will become new columns in the result.
values_from
: the column whose values will be in the new columns.
table2
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
pivot_wider(table2, names_from = type, values_from = count)
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Try it out with the pop_long
population data.
pivot_wider(pop_long, names_from = year, values_from = population)
# A tibble: 197 × 302
country `1800` `1801` `1802` `1803` `1804` `1805` `1806` `1807` `1808` `1809`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Afghan… 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6
2 Angola 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6
3 Albania 4 e5 4.02e5 4.04e5 4.05e5 4.07e5 4.09e5 4.11e5 4.13e5 4.14e5 4.16e5
4 Andorra 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3
5 UAE 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4
6 Argent… 5.34e5 5.20e5 5.06e5 4.92e5 4.79e5 4.66e5 4.53e5 4.41e5 4.29e5 4.17e5
7 Armenia 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5
8 Antigu… 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4
9 Austra… 2 e5 2.05e5 2.11e5 2.16e5 2.22e5 2.27e5 2.33e5 2.39e5 2.46e5 2.52e5
10 Austria 3 e6 3.02e6 3.04e6 3.05e6 3.07e6 3.09e6 3.11e6 3.12e6 3.14e6 3.16e6
# ℹ 187 more rows
# ℹ 291 more variables: `1810` <dbl>, `1811` <dbl>, `1812` <dbl>, `1813` <dbl>,
# `1814` <dbl>, `1815` <dbl>, `1816` <dbl>, `1817` <dbl>, `1818` <dbl>,
# `1819` <dbl>, `1820` <dbl>, `1821` <dbl>, `1822` <dbl>, `1823` <dbl>,
# `1824` <dbl>, `1825` <dbl>, `1826` <dbl>, `1827` <dbl>, `1828` <dbl>,
# `1829` <dbl>, `1830` <dbl>, `1831` <dbl>, `1832` <dbl>, `1833` <dbl>,
# `1834` <dbl>, `1835` <dbl>, `1836` <dbl>, `1837` <dbl>, `1838` <dbl>, …
separate
is useful for dealing with data in which a single column contains multiple variables.
seperate(tbl, col = <...>, into = c(<..., ..., ...>), sep = "...")
col
: column to split into multiple columns
into
: column names of new columns to be generated, supplied as a character vector (use quotes).
sep
: the separator used to split values in the col
column. Can be a character (_
) or a integer to indicate the character position to split (2).
table3
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
# A tibble: 6 × 4
country year cases pop
<chr> <dbl> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
Use the gapminder population dataset (pop_long
) to perform the following tasks and answer the following questions:
[1] "China"
[1] "China"
[1] 1224001525
filter(pop_long, year %in% as.character(1800:1899)) |>
group_by(country) |>
summarize(mean_pop = mean(population)) |>
arrange(mean_pop) |>
head(1) |>
_$country
[1] "Holy See"
# or
library(stringr) # or library(tidyverse)
mutate(pop_long,
century = as.numeric(str_sub(year, 1, 2)) + 1) |>
filter(century == 19) |>
group_by(country) |>
summarize(mean_pop = mean(population)) |>
arrange(mean_pop) |>
head(1) |>
_$country
[1] "Holy See"
bind_cols(tbl_1, tbl_2, ...)
bind_cols
will bind the columns from 2 or more tables into 1 table. Note that with column binds you need to ensure that each table has the same number of rows, and that the rows correspond to the same observations.
library(dplyr)
tbl1 <- data.frame(x = 1:3)
tbl2 <- data.frame(y = 3:5)
bind_cols(tbl1, tbl2)
x y
1 1 3
2 2 4
3 3 5
bind_rows
binds rows from multiple tables into one table. Similarly to bind_cols
you will want the columns to match between the tables, so that the observations are consistent with the variables.
bind_rows(tbl_1, tbl_2, ..., .id = NULL)
df_1 <- data.frame(x = 1:5, y = LETTERS[1:5])
df_2 <- data.frame(x = 11:15, y = LETTERS[6:10])
bind_rows(df_1, df_2)
x y
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 11 F
7 12 G
8 13 H
9 14 I
10 15 J
You can also use a list of data.frames
with bind_rows
. If the list is named, you can use the .id
argument to store a column specifying the name of the data.frame in the output.
x y
1 1 A
2 2 B
3 3 C
4 4 D
5 5 E
6 11 F
7 12 G
8 13 H
9 14 I
10 15 J
bind_rows(lst_of_dfs, .id = "source_table")
source_table x y
1 one 1 A
2 one 2 B
3 one 3 C
4 one 4 D
5 one 5 E
6 two 11 F
7 two 12 G
8 two 13 H
9 two 14 I
10 two 15 J
Join operations are used to join one table with another table by matching the values shared in particular columns. Join operations enable linking of multiple datasets that contain shared values.
There are multiple way to join two tables, depending on how you want to handle different combinations of values present or missing in two tables.
Assume we have two data.frames called x and y
The following joins add columns from y to x, matching rows based on the matching values in shared columns.
inner_join(x, y)
: includes all rows in x and y.
left_join(x, y)
: includes all rows in x.
right_join(x, y)
: includes all rows in y.
full_join(x, y)
: includes all rows in x or y.
If a row in x matches multiple rows in y, all the rows in y will be returned once for each matching row in x.
Consider our pop_long
data.frame. What if we wanted to add additional variables to the data.frame, such as the estimated GDP?
pop_long[1:5, ]
# A tibble: 5 × 3
country year population
<chr> <chr> <dbl>
1 Afghanistan 1800 3280000
2 Afghanistan 1801 3280000
3 Afghanistan 1802 3280000
4 Afghanistan 1803 3280000
5 Afghanistan 1804 3280000
First we’ll read in an additional dataset from Gapminder
that contains GDP estimates per country over time. Note that these datafiles have been preprocessed using code here
# read in and convert to long format
gdp_wide <- read_csv("data/income_per_person.csv")
gdp_long <- pivot_longer(gdp_wide,
-country,
names_to = "year",
values_to = "GDP")
gdp_long
# A tibble: 48,945 × 3
country year GDP
<chr> <chr> <dbl>
1 Afghanistan 1799 683
2 Afghanistan 1800 683
3 Afghanistan 1801 683
4 Afghanistan 1802 683
5 Afghanistan 1803 683
6 Afghanistan 1804 683
7 Afghanistan 1805 683
8 Afghanistan 1806 683
9 Afghanistan 1807 683
10 Afghanistan 1808 683
# ℹ 48,935 more rows
Now we can use various joins to merge these data.frames into 1 data.frame.
# join on country and year columns, keeping rows with values present in both tables
inner_join(gdp_long, pop_long)
# A tibble: 48,000 × 4
country year GDP population
<chr> <chr> <dbl> <dbl>
1 Afghanistan 1800 683 3280000
2 Afghanistan 1801 683 3280000
3 Afghanistan 1802 683 3280000
4 Afghanistan 1803 683 3280000
5 Afghanistan 1804 683 3280000
6 Afghanistan 1805 683 3280000
7 Afghanistan 1806 683 3280000
8 Afghanistan 1807 683 3280000
9 Afghanistan 1808 683 3280000
10 Afghanistan 1809 684 3280000
# ℹ 47,990 more rows
The Joining, by = join_by(country, year)
message indicates that the “country” and “year” columns were used to determine matching rows between the two tables. This is auto-detected based on shared column names in the two data.frames.
You can use the by
argument to explicitly specify the columns you’d like to join, which is useful if the columns of interest have different names in the two tables.
# same as above, but being explicit about the columns to use for joining.
# note that for joins you DO need to use quotes for the columns
inner_join(gdp_long, pop_long, by = c("country", "year"))
# A tibble: 48,000 × 4
country year GDP population
<chr> <chr> <dbl> <dbl>
1 Afghanistan 1800 683 3280000
2 Afghanistan 1801 683 3280000
3 Afghanistan 1802 683 3280000
4 Afghanistan 1803 683 3280000
5 Afghanistan 1804 683 3280000
6 Afghanistan 1805 683 3280000
7 Afghanistan 1806 683 3280000
8 Afghanistan 1807 683 3280000
9 Afghanistan 1808 683 3280000
10 Afghanistan 1809 684 3280000
# ℹ 47,990 more rows
# unless you use the `join_by` helper
inner_join(gdp_long, pop_long, by = join_by(country, year))
# A tibble: 48,000 × 4
country year GDP population
<chr> <chr> <dbl> <dbl>
1 Afghanistan 1800 683 3280000
2 Afghanistan 1801 683 3280000
3 Afghanistan 1802 683 3280000
4 Afghanistan 1803 683 3280000
5 Afghanistan 1804 683 3280000
6 Afghanistan 1805 683 3280000
7 Afghanistan 1806 683 3280000
8 Afghanistan 1807 683 3280000
9 Afghanistan 1808 683 3280000
10 Afghanistan 1809 684 3280000
# ℹ 47,990 more rows
# join on country and year columns, keeping values all values from gdp_long data.frame
left_join(gdp_long, pop_long)
# A tibble: 48,945 × 4
country year GDP population
<chr> <chr> <dbl> <dbl>
1 Afghanistan 1799 683 NA
2 Afghanistan 1800 683 3280000
3 Afghanistan 1801 683 3280000
4 Afghanistan 1802 683 3280000
5 Afghanistan 1803 683 3280000
6 Afghanistan 1804 683 3280000
7 Afghanistan 1805 683 3280000
8 Afghanistan 1806 683 3280000
9 Afghanistan 1807 683 3280000
10 Afghanistan 1808 683 3280000
# ℹ 48,935 more rows
# join on country and year columns, keeping values all values from gdp_long and pop_long data.frame
full_join(gdp_long, pop_long)
# A tibble: 60,242 × 4
country year GDP population
<chr> <chr> <dbl> <dbl>
1 Afghanistan 1799 683 NA
2 Afghanistan 1800 683 3280000
3 Afghanistan 1801 683 3280000
4 Afghanistan 1802 683 3280000
5 Afghanistan 1803 683 3280000
6 Afghanistan 1804 683 3280000
7 Afghanistan 1805 683 3280000
8 Afghanistan 1806 683 3280000
9 Afghanistan 1807 683 3280000
10 Afghanistan 1808 683 3280000
# ℹ 60,232 more rows
Join operations will often generate missing data (e.g. NA
values).
NA
, NaN
and NULL
Don’t use use zeroes to represent missing data. 0
is valid observed value.
NA
(Not Available) is most often use to represent missing data.
NaN
(Not a Number) is the result of an undefined operation, e.g. 0 / 0
.
NULL
means “undefined” and is only used in a programming context (i.e., a function that returns NULL
). You can’t put NULL
values in a data frame.
Let’s examine the output from the full_join()
operation above which generated NA values.
country_stats <- full_join(gdp_long, pop_long)
country_stats
# A tibble: 60,242 × 4
country year GDP population
<chr> <chr> <dbl> <dbl>
1 Afghanistan 1799 683 NA
2 Afghanistan 1800 683 3280000
3 Afghanistan 1801 683 3280000
4 Afghanistan 1802 683 3280000
5 Afghanistan 1803 683 3280000
6 Afghanistan 1804 683 3280000
7 Afghanistan 1805 683 3280000
8 Afghanistan 1806 683 3280000
9 Afghanistan 1807 683 3280000
10 Afghanistan 1808 683 3280000
# ℹ 60,232 more rows
NA
valuesfilter
with is.na()
You can identify variables with NA
values by combining filter()
and is.na()
.
na.omit()
You can remove all rows containing NA
values with na.omit()
.
na.omit(country_stats)
NA
valuesInstead of removing NA
values we can instead just exclude NA
values from operations with a common optional argument na.rm = TRUE
.
Also you can remove NaN
values by detecting for their presence using is.nan()
. These values often occur when a summary operation (e.g. mean or sum) is performed on a vector with 0 elements.
NA
valuesLet’s replace the NA values in the population
column with a number, such as -1234.
This is an operation that is easy to do with base R []
approach.
# use is.na to identify NA values to replace with -1234
country_stats$population[is.na(country_stats$population)] <- -1234
country_stats[1:10, ]
# A tibble: 10 × 4
country year GDP population
<chr> <chr> <dbl> <dbl>
1 Afghanistan 1799 683 -1234
2 Afghanistan 1800 683 3280000
3 Afghanistan 1801 683 3280000
4 Afghanistan 1802 683 3280000
5 Afghanistan 1803 683 3280000
6 Afghanistan 1804 683 3280000
7 Afghanistan 1805 683 3280000
8 Afghanistan 1806 683 3280000
9 Afghanistan 1807 683 3280000
10 Afghanistan 1808 683 3280000
Alternatively you can use the ifelse()
base R function.
x <- 1:10
ifelse(x < 5, # an expression producing a logical vector
5, # if TRUE, replace with this expression
x) # if FALSE replace with this expression
[1] 5 5 5 5 5 6 7 8 9 10
Replace -1234
with NA
using base R $
notation to identify columns.
country_stats$population <- ifelse(country_stats$population == -1234,
NA,
country_stats$population)
country_stats[1:10, ]
# A tibble: 10 × 4
country year GDP population
<chr> <chr> <dbl> <dbl>
1 Afghanistan 1799 683 NA
2 Afghanistan 1800 683 3280000
3 Afghanistan 1801 683 3280000
4 Afghanistan 1802 683 3280000
5 Afghanistan 1803 683 3280000
6 Afghanistan 1804 683 3280000
7 Afghanistan 1805 683 3280000
8 Afghanistan 1806 683 3280000
9 Afghanistan 1807 683 3280000
10 Afghanistan 1808 683 3280000
The same can also be done with dplyr, in this case replacing NA
values again with -1234.
# A tibble: 60,242 × 4
country year GDP population
<chr> <chr> <dbl> <dbl>
1 Afghanistan 1799 683 -1234
2 Afghanistan 1800 683 3280000
3 Afghanistan 1801 683 3280000
4 Afghanistan 1802 683 3280000
5 Afghanistan 1803 683 3280000
6 Afghanistan 1804 683 3280000
7 Afghanistan 1805 683 3280000
8 Afghanistan 1806 683 3280000
9 Afghanistan 1807 683 3280000
10 Afghanistan 1808 683 3280000
# ℹ 60,232 more rows
If you want to perform more complex operations use case_when()
from dplyr. case_when()
is equivalent to performing multiple nested ifelse()
operations, whereby if the first operation is not TRUE, then check for the second condition, repeating for each condition until there are no more statements.
the syntax for case when is :
`case_when(conditional statement ~ "value in result if TRUE",
conditional statement #2 ~ "another value in result if",
TRUE ~ "default if neither conditional statement 1 or 2 are TRUE")`
For a use case, imagine that we wanted to add a new column called era
, which signified if the year was in the past
, present
or future.
country_stats |>
mutate(
era = case_when(year < 2023 ~ "past",
year == 2023 ~ "present",
year > 2023 ~ "future")
)
# A tibble: 60,242 × 5
country year GDP population era
<chr> <chr> <dbl> <dbl> <chr>
1 Afghanistan 1799 683 NA past
2 Afghanistan 1800 683 3280000 past
3 Afghanistan 1801 683 3280000 past
4 Afghanistan 1802 683 3280000 past
5 Afghanistan 1803 683 3280000 past
6 Afghanistan 1804 683 3280000 past
7 Afghanistan 1805 683 3280000 past
8 Afghanistan 1806 683 3280000 past
9 Afghanistan 1807 683 3280000 past
10 Afghanistan 1808 683 3280000 past
# ℹ 60,232 more rows
# same as above, using TRUE on the left side provides a default value.
country_stats |>
mutate(
era = case_when(year < 2023 ~ "past",
year == 2023 ~ "present",
TRUE ~ "future")
)
# A tibble: 60,242 × 5
country year GDP population era
<chr> <chr> <dbl> <dbl> <chr>
1 Afghanistan 1799 683 NA past
2 Afghanistan 1800 683 3280000 past
3 Afghanistan 1801 683 3280000 past
4 Afghanistan 1802 683 3280000 past
5 Afghanistan 1803 683 3280000 past
6 Afghanistan 1804 683 3280000 past
7 Afghanistan 1805 683 3280000 past
8 Afghanistan 1806 683 3280000 past
9 Afghanistan 1807 683 3280000 past
10 Afghanistan 1808 683 3280000 past
# ℹ 60,232 more rows
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.2.1
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
time zone: America/Denver
tzcode source: internal
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] stringr_1.5.1 dplyr_1.1.4 tidyr_1.3.0 readr_2.1.4
loaded via a namespace (and not attached):
[1] bit_4.0.5 jsonlite_1.8.8 compiler_4.3.1 highr_0.10
[5] crayon_1.5.2 tidyselect_1.2.0 parallel_4.3.1 jquerylib_0.1.4
[9] yaml_2.3.7 fastmap_1.1.1 R6_2.5.1 generics_0.1.3
[13] knitr_1.45 tibble_3.2.1 distill_1.6 bslib_0.6.1
[17] pillar_1.9.0 tzdb_0.4.0 rlang_1.1.2 utf8_1.2.4
[21] stringi_1.8.2 cachem_1.0.8 xfun_0.41 sass_0.4.7
[25] bit64_4.0.5 memoise_2.0.1 cli_3.6.1 withr_2.5.2
[29] magrittr_2.0.3 digest_0.6.33 vroom_1.6.4 rstudioapi_0.15.0
[33] hms_1.1.3 lifecycle_1.0.4 vctrs_0.6.5 downlit_0.4.3
[37] evaluate_0.23 glue_1.6.2 fansi_1.0.5 purrr_1.0.2
[41] rmarkdown_2.25 tools_4.3.1 pkgconfig_2.0.3 htmltools_0.5.7
The content of this class borrows heavily from previous tutorials:
Tutorial organization: https://github.com/sjaganna/molb7910-2019
R tutorials and documentation:
https://github.com/tidyverse/dplyr
https://r4ds.had.co.nz/index.html