Class 4: Reshaping data into a tidy format

Kent Riemondy https://github.com/kriemo (RNA Bioscience Initiative)https://medschool.cuanschutz.edu/rbi
12-05-2023

The Rmarkdown for this class is on github

Goals for today

“Data Scientists spend up to 80% of the time on data cleaning and 20 percent of their time on actual data analysis.” – Exploratory Data Mining and Data Cleaning. Dasu and Johnson

Wide versus long data formats

Data can be represented in multiple formats. Today we will discuss two common tabular formats for organizing data for analysis.

Consider the following dataset, which contains population estimates for countries throughout history. This representation of data is commonly referred to as ‘wide’ data format, which is a matrix-like format containing samples as rows and features as columns, with values associated with each observation of a sample and feature.

library(readr)
pop_wide <- read_csv("data/country_population.csv")
pop_wide
# A tibble: 197 × 302
   country `1800` `1801` `1802` `1803` `1804` `1805` `1806` `1807` `1808` `1809`
   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Afghan… 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6
 2 Angola  1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6
 3 Albania 4   e5 4.02e5 4.04e5 4.05e5 4.07e5 4.09e5 4.11e5 4.13e5 4.14e5 4.16e5
 4 Andorra 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3
 5 UAE     4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4
 6 Argent… 5.34e5 5.20e5 5.06e5 4.92e5 4.79e5 4.66e5 4.53e5 4.41e5 4.29e5 4.17e5
 7 Armenia 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5
 8 Antigu… 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4
 9 Austra… 2   e5 2.05e5 2.11e5 2.16e5 2.22e5 2.27e5 2.33e5 2.39e5 2.46e5 2.52e5
10 Austria 3   e6 3.02e6 3.04e6 3.05e6 3.07e6 3.09e6 3.11e6 3.12e6 3.14e6 3.16e6
# ℹ 187 more rows
# ℹ 291 more variables: `1810` <dbl>, `1811` <dbl>, `1812` <dbl>, `1813` <dbl>,
#   `1814` <dbl>, `1815` <dbl>, `1816` <dbl>, `1817` <dbl>, `1818` <dbl>,
#   `1819` <dbl>, `1820` <dbl>, `1821` <dbl>, `1822` <dbl>, `1823` <dbl>,
#   `1824` <dbl>, `1825` <dbl>, `1826` <dbl>, `1827` <dbl>, `1828` <dbl>,
#   `1829` <dbl>, `1830` <dbl>, `1831` <dbl>, `1832` <dbl>, `1833` <dbl>,
#   `1834` <dbl>, `1835` <dbl>, `1836` <dbl>, `1837` <dbl>, `1838` <dbl>, …

The wide matrix-like format is very useful and a common format used for statistics and machine learning. Matrices can take advantage of optimized numerical routines and are the data representation of mathematical matrices. We will work with matrices later in class, particularly with their use to generate heatmaps.

Representing data in a matrix however has a few practical implications:

  1. There is only 1 type of data stored in a matrix-like representation (e.g. each cell is the same unit of observation, the population per country). To store additional related data types (e.g. the countries GDP each year) you need to place each new value in an independent matrix.

  2. The matrix-like format does not easily lend itself to more complicated summaries. For example, what if we wanted to average the GDP values for each decade or century? We would have to write rather complicated code to parse out subsets of columns for each time period, average them, then merge them into a summary matrix.

Data in a matrix can be instead formatted into a long (also called “tidy”) format.

#> # A tibble: 10 × 3
#>    country     year  population
#>    <chr>       <chr>      <dbl>
#>  1 Afghanistan 1800     3280000
#>  2 Afghanistan 1801     3280000
#>  3 Afghanistan 1802     3280000
#>  4 Afghanistan 1803     3280000
#>  5 Afghanistan 1804     3280000
#>  6 Afghanistan 1805     3280000
#>  7 Afghanistan 1806     3280000
#>  8 Afghanistan 1807     3280000
#>  9 Afghanistan 1808     3280000
#> 10 Afghanistan 1809     3280000

The long format of this data convert the many columns of a matrix into a 3 column data.frame containing 3 variables (country, year, and population).

Tidy data format

“Tidy datasets are all alike, but every messy dataset is messy in its own way.” –– Hadley Wickham

A tidy dataset is structured in a manner to be most effectively processed in R using the tidyverse. For example, with the population dataset, instead of having to provide logic to process 100s of columns, instead there are only 3 columns.

Most data tables that you’ve worked with are probably not tidy. It takes experience to understand the best way to format the data for data processing. As you work more in R and the tidyverse this will become more natural.

Tidy data has the following attributes:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

What is a variable, what is an observation, and what is a value?

  1. A value is a number or word, e.g. the population.

  2. Every value belongs to a variable and an observation, e.g. the population value observed in Austria in the year 1910.

  3. A variable contains all values that measure the same attribute (e.g. height, temperature, duration, magnitude) across units. (e.g. Austria is a value of the country variable, 1910 is a value of the year variable).

  4. An observation contains all values measured on the same unit across attributes (e.g observations about Austria in 1910).

Shown below is a simplified data table in a tidy format, provided by the tidyr package. This data table shows the # of TB cases documented by the WHO in a few countries in the years 1999 and 2000.

library(tidyr)
table1
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

The same data, represented in wide, matrix-like format, would require 2 tables:

e.g a table with the cases values per country.

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

e.g a table with the population values per country

table4b
# A tibble: 3 × 3
  country         `1999`     `2000`
  <chr>            <dbl>      <dbl>
1 Afghanistan   19987071   20595360
2 Brazil       172006362  174504898
3 China       1272915272 1280428583

What advantages does the tidy format provide?

  1. Easy to generate summaries of the data. e.g. via group_by() -> summarize()

  2. Easy to plot the data using the ggplot2 framework (more on that in later classes)

  3. Very easy to join multiple related data frames based on key values.

Some disadvantages:

  1. Not space efficient

  2. Not intuitive

  3. Doesn’t interface well with traditional machine learning and statistical approaches.

Converting between long and wide formats using tidyr

The tidyr package provides functionality to convert datasets into tidy formats.

Reshaping wide data to long

The pivot_longer function requires specifying the columns to pivot using the tidyselect syntax. This syntax is used elsewhere in the tidyverse and is a useful shorthand to avoid listing all columns of interest.

pivot_longer(tbl, cols = <...>)

Tables from tidyr cheatsheet from https://posit.co/wp-content/uploads/2022/10/tidyr.pdf

Figure 1: Tables from tidyr cheatsheet from https://posit.co/wp-content/uploads/2022/10/tidyr.pdf

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
pivot_longer(table4a, cols = `1999`:`2000`) # pivot columns from 1999 -> 2000
# A tibble: 6 × 3
  country     name   value
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766
pivot_longer(table4a, cols = -country) # pivot all columns not matching country
# A tibble: 6 × 3
  country     name   value
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

Let’s try it out on the pop_wide population data

pop_long <- pivot_longer(pop_wide, cols = -country)

pop_long <- pivot_longer(pop_wide, 
                         cols = -country, 
                         names_to = "year",
                         values_to = "population")

Why is the useful? Well now we can quickly use dplyr to answer questions, such as what is the average population per country across all years?

library(dplyr)
group_by(pop_long, country) |> 
  summarize(mean_population = mean(population))
# A tibble: 197 × 2
   country             mean_population
   <chr>                         <dbl>
 1 Afghanistan               28038306.
 2 Albania                    1530495.
 3 Algeria                   23736578.
 4 Andorra                      31687.
 5 Angola                    27240465.
 6 Antigua and Barbuda          58430.
 7 Argentina                 22730847.
 8 Armenia                    1637548.
 9 Australia                 13964223.
10 Austria                    6573422.
# ℹ 187 more rows

Reshaping long data to wide

pivot_wider(tbl, names_from = <...>, values_from = <...>)

names_from: the column whose values will become new columns in the result.
values_from: the column whose values will be in the new columns.

table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583
pivot_wider(table2, names_from = type, values_from = count)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Try it out with the pop_long population data.

pivot_wider(pop_long, names_from = year, values_from = population)
# A tibble: 197 × 302
   country `1800` `1801` `1802` `1803` `1804` `1805` `1806` `1807` `1808` `1809`
   <chr>    <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Afghan… 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6 3.28e6
 2 Angola  1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6 1.57e6
 3 Albania 4   e5 4.02e5 4.04e5 4.05e5 4.07e5 4.09e5 4.11e5 4.13e5 4.14e5 4.16e5
 4 Andorra 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3 2.65e3
 5 UAE     4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4 4.02e4
 6 Argent… 5.34e5 5.20e5 5.06e5 4.92e5 4.79e5 4.66e5 4.53e5 4.41e5 4.29e5 4.17e5
 7 Armenia 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5 4.13e5
 8 Antigu… 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4 3.7 e4
 9 Austra… 2   e5 2.05e5 2.11e5 2.16e5 2.22e5 2.27e5 2.33e5 2.39e5 2.46e5 2.52e5
10 Austria 3   e6 3.02e6 3.04e6 3.05e6 3.07e6 3.09e6 3.11e6 3.12e6 3.14e6 3.16e6
# ℹ 187 more rows
# ℹ 291 more variables: `1810` <dbl>, `1811` <dbl>, `1812` <dbl>, `1813` <dbl>,
#   `1814` <dbl>, `1815` <dbl>, `1816` <dbl>, `1817` <dbl>, `1818` <dbl>,
#   `1819` <dbl>, `1820` <dbl>, `1821` <dbl>, `1822` <dbl>, `1823` <dbl>,
#   `1824` <dbl>, `1825` <dbl>, `1826` <dbl>, `1827` <dbl>, `1828` <dbl>,
#   `1829` <dbl>, `1830` <dbl>, `1831` <dbl>, `1832` <dbl>, `1833` <dbl>,
#   `1834` <dbl>, `1835` <dbl>, `1836` <dbl>, `1837` <dbl>, `1838` <dbl>, …

Separate

separate is useful for dealing with data in which a single column contains multiple variables.

seperate(tbl, col = <...>, into = c(<..., ..., ...>), sep = "...")

col: column to split into multiple columns
into: column names of new columns to be generated, supplied as a character vector (use quotes).
sep: the separator used to split values in the col column. Can be a character (_) or a integer to indicate the character position to split (2).

table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583
separate(table3, col = rate, into = c("cases", "pop"), sep = "/")
# A tibble: 6 × 4
  country      year cases  pop       
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

Exercises

Use the gapminder population dataset (pop_long) to perform the following tasks and answer the following questions:

  1. Which country had the highest population in 1810?
filter(pop_long, year == "1810") |>
  arrange(desc(population)) |>
  head(1) |>
  _$country
[1] "China"
# or
filter(pop_long, year == "1810") |>
  arrange(desc(population)) |>
  head(1) |>
  pull(country)
[1] "China"
  1. What was the world population in the year 1840?
filter(pop_long, year == "1840") |>
  summarize(world_pop = sum(population)) |>
  _$world_pop
[1] 1224001525
  1. Which country had the lowest average population in the 19th century (years 1800-1899)?
filter(pop_long, year %in% as.character(1800:1899)) |>
  group_by(country) |>
  summarize(mean_pop = mean(population)) |>
  arrange(mean_pop) |>
  head(1) |>
  _$country
[1] "Holy See"
# or
library(stringr) # or library(tidyverse)
mutate(pop_long,
       century = as.numeric(str_sub(year, 1, 2)) + 1) |>
  filter(century == 19) |>
  group_by(country) |>
  summarize(mean_pop = mean(population)) |>
  arrange(mean_pop) |>
  head(1) |>
  _$country
[1] "Holy See"

Using binds and joins to aggregate multiple data.frames

column binds

bind_cols(tbl_1, tbl_2, ...)

bind_cols will bind the columns from 2 or more tables into 1 table. Note that with column binds you need to ensure that each table has the same number of rows, and that the rows correspond to the same observations.

library(dplyr)
tbl1 <- data.frame(x = 1:3)
tbl2 <- data.frame(y = 3:5)
bind_cols(tbl1, tbl2)
  x y
1 1 3
2 2 4
3 3 5

row binds

bind_rows binds rows from multiple tables into one table. Similarly to bind_cols you will want the columns to match between the tables, so that the observations are consistent with the variables.

bind_rows(tbl_1, tbl_2, ..., .id = NULL)

df_1 <- data.frame(x = 1:5, y = LETTERS[1:5])
df_2 <- data.frame(x = 11:15, y = LETTERS[6:10])

bind_rows(df_1, df_2)
    x y
1   1 A
2   2 B
3   3 C
4   4 D
5   5 E
6  11 F
7  12 G
8  13 H
9  14 I
10 15 J

You can also use a list of data.frames with bind_rows. If the list is named, you can use the .id argument to store a column specifying the name of the data.frame in the output.

lst_of_dfs <- list(one = df_1,
                   two = df_2)

bind_rows(lst_of_dfs)
    x y
1   1 A
2   2 B
3   3 C
4   4 D
5   5 E
6  11 F
7  12 G
8  13 H
9  14 I
10 15 J
bind_rows(lst_of_dfs, .id = "source_table")
   source_table  x y
1           one  1 A
2           one  2 B
3           one  3 C
4           one  4 D
5           one  5 E
6           two 11 F
7           two 12 G
8           two 13 H
9           two 14 I
10          two 15 J

Joins

Join operations are used to join one table with another table by matching the values shared in particular columns. Join operations enable linking of multiple datasets that contain shared values.

There are multiple way to join two tables, depending on how you want to handle different combinations of values present or missing in two tables.

Assume we have two data.frames called x and y

The following joins add columns from y to x, matching rows based on the matching values in shared columns.

inner_join(x, y): includes all rows in x and y.

left_join(x, y): includes all rows in x.

right_join(x, y): includes all rows in y.

full_join(x, y): includes all rows in x or y.

If a row in x matches multiple rows in y, all the rows in y will be returned once for each matching row in x.

Consider our pop_long data.frame. What if we wanted to add additional variables to the data.frame, such as the estimated GDP?

pop_long[1:5, ]
# A tibble: 5 × 3
  country     year  population
  <chr>       <chr>      <dbl>
1 Afghanistan 1800     3280000
2 Afghanistan 1801     3280000
3 Afghanistan 1802     3280000
4 Afghanistan 1803     3280000
5 Afghanistan 1804     3280000

First we’ll read in an additional dataset from Gapminder that contains GDP estimates per country over time. Note that these datafiles have been preprocessed using code here

# read in and convert to long format
gdp_wide <- read_csv("data/income_per_person.csv")
gdp_long <- pivot_longer(gdp_wide, 
                         -country, 
                         names_to = "year",
                         values_to = "GDP")
gdp_long
# A tibble: 48,945 × 3
   country     year    GDP
   <chr>       <chr> <dbl>
 1 Afghanistan 1799    683
 2 Afghanistan 1800    683
 3 Afghanistan 1801    683
 4 Afghanistan 1802    683
 5 Afghanistan 1803    683
 6 Afghanistan 1804    683
 7 Afghanistan 1805    683
 8 Afghanistan 1806    683
 9 Afghanistan 1807    683
10 Afghanistan 1808    683
# ℹ 48,935 more rows

Now we can use various joins to merge these data.frames into 1 data.frame.

# join on country and year columns, keeping rows with values present in both tables
inner_join(gdp_long, pop_long)
# A tibble: 48,000 × 4
   country     year    GDP population
   <chr>       <chr> <dbl>      <dbl>
 1 Afghanistan 1800    683    3280000
 2 Afghanistan 1801    683    3280000
 3 Afghanistan 1802    683    3280000
 4 Afghanistan 1803    683    3280000
 5 Afghanistan 1804    683    3280000
 6 Afghanistan 1805    683    3280000
 7 Afghanistan 1806    683    3280000
 8 Afghanistan 1807    683    3280000
 9 Afghanistan 1808    683    3280000
10 Afghanistan 1809    684    3280000
# ℹ 47,990 more rows

The Joining, by = join_by(country, year) message indicates that the “country” and “year” columns were used to determine matching rows between the two tables. This is auto-detected based on shared column names in the two data.frames.

You can use the by argument to explicitly specify the columns you’d like to join, which is useful if the columns of interest have different names in the two tables.

# same as above, but being explicit about the columns to use for joining.

# note that for joins you DO need to use quotes for the columns
inner_join(gdp_long, pop_long, by = c("country", "year"))
# A tibble: 48,000 × 4
   country     year    GDP population
   <chr>       <chr> <dbl>      <dbl>
 1 Afghanistan 1800    683    3280000
 2 Afghanistan 1801    683    3280000
 3 Afghanistan 1802    683    3280000
 4 Afghanistan 1803    683    3280000
 5 Afghanistan 1804    683    3280000
 6 Afghanistan 1805    683    3280000
 7 Afghanistan 1806    683    3280000
 8 Afghanistan 1807    683    3280000
 9 Afghanistan 1808    683    3280000
10 Afghanistan 1809    684    3280000
# ℹ 47,990 more rows
# unless you use the `join_by` helper
inner_join(gdp_long, pop_long, by = join_by(country, year))
# A tibble: 48,000 × 4
   country     year    GDP population
   <chr>       <chr> <dbl>      <dbl>
 1 Afghanistan 1800    683    3280000
 2 Afghanistan 1801    683    3280000
 3 Afghanistan 1802    683    3280000
 4 Afghanistan 1803    683    3280000
 5 Afghanistan 1804    683    3280000
 6 Afghanistan 1805    683    3280000
 7 Afghanistan 1806    683    3280000
 8 Afghanistan 1807    683    3280000
 9 Afghanistan 1808    683    3280000
10 Afghanistan 1809    684    3280000
# ℹ 47,990 more rows
# join on country and year columns, keeping values all values from gdp_long data.frame
left_join(gdp_long, pop_long)
# A tibble: 48,945 × 4
   country     year    GDP population
   <chr>       <chr> <dbl>      <dbl>
 1 Afghanistan 1799    683         NA
 2 Afghanistan 1800    683    3280000
 3 Afghanistan 1801    683    3280000
 4 Afghanistan 1802    683    3280000
 5 Afghanistan 1803    683    3280000
 6 Afghanistan 1804    683    3280000
 7 Afghanistan 1805    683    3280000
 8 Afghanistan 1806    683    3280000
 9 Afghanistan 1807    683    3280000
10 Afghanistan 1808    683    3280000
# ℹ 48,935 more rows
# join on country and year columns, keeping values all values from gdp_long and pop_long data.frame
full_join(gdp_long, pop_long)
# A tibble: 60,242 × 4
   country     year    GDP population
   <chr>       <chr> <dbl>      <dbl>
 1 Afghanistan 1799    683         NA
 2 Afghanistan 1800    683    3280000
 3 Afghanistan 1801    683    3280000
 4 Afghanistan 1802    683    3280000
 5 Afghanistan 1803    683    3280000
 6 Afghanistan 1804    683    3280000
 7 Afghanistan 1805    683    3280000
 8 Afghanistan 1806    683    3280000
 9 Afghanistan 1807    683    3280000
10 Afghanistan 1808    683    3280000
# ℹ 60,232 more rows

Missing data

Join operations will often generate missing data (e.g. NA values).

Zeroes, NA, NaN and NULL

Let’s examine the output from the full_join() operation above which generated NA values.

country_stats <- full_join(gdp_long, pop_long)
country_stats
# A tibble: 60,242 × 4
   country     year    GDP population
   <chr>       <chr> <dbl>      <dbl>
 1 Afghanistan 1799    683         NA
 2 Afghanistan 1800    683    3280000
 3 Afghanistan 1801    683    3280000
 4 Afghanistan 1802    683    3280000
 5 Afghanistan 1803    683    3280000
 6 Afghanistan 1804    683    3280000
 7 Afghanistan 1805    683    3280000
 8 Afghanistan 1806    683    3280000
 9 Afghanistan 1807    683    3280000
10 Afghanistan 1808    683    3280000
# ℹ 60,232 more rows

Quick check for NA values

sum(is.na(country_stats))
[1] 12342
any(is.na(country_stats))
[1] TRUE

filter with is.na()

You can identify variables with NA values by combining filter() and is.na().

# find rows where GDP is NA
filter(country_stats, is.na(GDP))

# find rows where GDP is *not* NA
filter(country_stats, !is.na(GDP))

na.omit()

You can remove all rows containing NA values with na.omit().

na.omit(country_stats)

Computing with NA values

Instead of removing NA values we can instead just exclude NA values from operations with a common optional argument na.rm = TRUE.

x <- c(1, NA, 3)
sum(x)
sum(x, na.rm = TRUE)

# if NAs are present, the result is NA
sum(country_stats$GDP)

# solution: exclude NAs from the calculation
sum(country_stats$GDP, na.rm = TRUE)
group_by(country_stats, country) %>% 
  summarize(avg_GDP = mean(GDP, na.rm = TRUE))

Also you can remove NaN values by detecting for their presence using is.nan(). These values often occur when a summary operation (e.g. mean or sum) is performed on a vector with 0 elements.

x <- 1:10
# none are TRUE
x <- x[x > 100]
x
integer(0)
length(x)
[1] 0
mean(x)
[1] NaN
mean(c(1, NaN), na.rm = TRUE)
[1] 1

Replacing NA values

Let’s replace the NA values in the population column with a number, such as -1234.

This is an operation that is easy to do with base R [] approach.

# use is.na to identify NA values to replace with -1234
country_stats$population[is.na(country_stats$population)] <- -1234

country_stats[1:10, ]
# A tibble: 10 × 4
   country     year    GDP population
   <chr>       <chr> <dbl>      <dbl>
 1 Afghanistan 1799    683      -1234
 2 Afghanistan 1800    683    3280000
 3 Afghanistan 1801    683    3280000
 4 Afghanistan 1802    683    3280000
 5 Afghanistan 1803    683    3280000
 6 Afghanistan 1804    683    3280000
 7 Afghanistan 1805    683    3280000
 8 Afghanistan 1806    683    3280000
 9 Afghanistan 1807    683    3280000
10 Afghanistan 1808    683    3280000

Alternatively you can use the ifelse() base R function.

x <- 1:10

ifelse(x < 5, # an expression producing a logical vector 
       5,     # if TRUE, replace with this expression
       x)     # if FALSE replace with this expression
 [1]  5  5  5  5  5  6  7  8  9 10

Replace -1234 with NA using base R $ notation to identify columns.

country_stats$population <- ifelse(country_stats$population == -1234,
                                   NA,
                                   country_stats$population)
country_stats[1:10, ]
# A tibble: 10 × 4
   country     year    GDP population
   <chr>       <chr> <dbl>      <dbl>
 1 Afghanistan 1799    683         NA
 2 Afghanistan 1800    683    3280000
 3 Afghanistan 1801    683    3280000
 4 Afghanistan 1802    683    3280000
 5 Afghanistan 1803    683    3280000
 6 Afghanistan 1804    683    3280000
 7 Afghanistan 1805    683    3280000
 8 Afghanistan 1806    683    3280000
 9 Afghanistan 1807    683    3280000
10 Afghanistan 1808    683    3280000

The same can also be done with dplyr, in this case replacing NA values again with -1234.

mutate(country_stats, 
       population = ifelse(is.na(population), 
                           -1234,
                           population)) 
# A tibble: 60,242 × 4
   country     year    GDP population
   <chr>       <chr> <dbl>      <dbl>
 1 Afghanistan 1799    683      -1234
 2 Afghanistan 1800    683    3280000
 3 Afghanistan 1801    683    3280000
 4 Afghanistan 1802    683    3280000
 5 Afghanistan 1803    683    3280000
 6 Afghanistan 1804    683    3280000
 7 Afghanistan 1805    683    3280000
 8 Afghanistan 1806    683    3280000
 9 Afghanistan 1807    683    3280000
10 Afghanistan 1808    683    3280000
# ℹ 60,232 more rows

case_when()

If you want to perform more complex operations use case_when() from dplyr. case_when() is equivalent to performing multiple nested ifelse() operations, whereby if the first operation is not TRUE, then check for the second condition, repeating for each condition until there are no more statements.

the syntax for case when is :

`case_when(conditional statement ~ "value in result if TRUE",
           conditional statement #2 ~ "another value in result if",
           TRUE ~ "default if neither conditional statement 1 or 2 are TRUE")`

For a use case, imagine that we wanted to add a new column called era, which signified if the year was in the past, present or future.

country_stats |>
  mutate(
    era = case_when(year < 2023 ~ "past",
                    year == 2023 ~ "present",
                    year > 2023 ~ "future")
    )
# A tibble: 60,242 × 5
   country     year    GDP population era  
   <chr>       <chr> <dbl>      <dbl> <chr>
 1 Afghanistan 1799    683         NA past 
 2 Afghanistan 1800    683    3280000 past 
 3 Afghanistan 1801    683    3280000 past 
 4 Afghanistan 1802    683    3280000 past 
 5 Afghanistan 1803    683    3280000 past 
 6 Afghanistan 1804    683    3280000 past 
 7 Afghanistan 1805    683    3280000 past 
 8 Afghanistan 1806    683    3280000 past 
 9 Afghanistan 1807    683    3280000 past 
10 Afghanistan 1808    683    3280000 past 
# ℹ 60,232 more rows
# same as above, using TRUE on the left side provides a default value.
country_stats |>
  mutate(
    era = case_when(year < 2023 ~ "past",
                    year == 2023 ~ "present",
                    TRUE ~ "future")
    ) 
# A tibble: 60,242 × 5
   country     year    GDP population era  
   <chr>       <chr> <dbl>      <dbl> <chr>
 1 Afghanistan 1799    683         NA past 
 2 Afghanistan 1800    683    3280000 past 
 3 Afghanistan 1801    683    3280000 past 
 4 Afghanistan 1802    683    3280000 past 
 5 Afghanistan 1803    683    3280000 past 
 6 Afghanistan 1804    683    3280000 past 
 7 Afghanistan 1805    683    3280000 past 
 8 Afghanistan 1806    683    3280000 past 
 9 Afghanistan 1807    683    3280000 past 
10 Afghanistan 1808    683    3280000 past 
# ℹ 60,232 more rows
Show session info
R version 4.3.1 (2023-06-16)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.2.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Denver
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] stringr_1.5.1 dplyr_1.1.4   tidyr_1.3.0   readr_2.1.4  

loaded via a namespace (and not attached):
 [1] bit_4.0.5         jsonlite_1.8.8    compiler_4.3.1    highr_0.10       
 [5] crayon_1.5.2      tidyselect_1.2.0  parallel_4.3.1    jquerylib_0.1.4  
 [9] yaml_2.3.7        fastmap_1.1.1     R6_2.5.1          generics_0.1.3   
[13] knitr_1.45        tibble_3.2.1      distill_1.6       bslib_0.6.1      
[17] pillar_1.9.0      tzdb_0.4.0        rlang_1.1.2       utf8_1.2.4       
[21] stringi_1.8.2     cachem_1.0.8      xfun_0.41         sass_0.4.7       
[25] bit64_4.0.5       memoise_2.0.1     cli_3.6.1         withr_2.5.2      
[29] magrittr_2.0.3    digest_0.6.33     vroom_1.6.4       rstudioapi_0.15.0
[33] hms_1.1.3         lifecycle_1.0.4   vctrs_0.6.5       downlit_0.4.3    
[37] evaluate_0.23     glue_1.6.2        fansi_1.0.5       purrr_1.0.2      
[41] rmarkdown_2.25    tools_4.3.1       pkgconfig_2.0.3   htmltools_0.5.7  

Acknowledgements and additional references

The content of this class borrows heavily from previous tutorials:

Tutorial organization: https://github.com/sjaganna/molb7910-2019

R tutorials and documentation: https://github.com/tidyverse/dplyr
https://r4ds.had.co.nz/index.html