R Bootcamp - Day 2

Tidy data & tidyr

Matt Taliaferro

RNA Bioscience Initiative | CU Anschutz

2024-10-21

Class 2 outline

  • Become familiar with “tidyverse” suite of packages
  • Introduce readr (Exercise 1)
  • Introduce tidyr (Exercise 2)
  • Tools for data exploration (Exercises 3-5)
  • Learn tidyr functions:
  • pivot_wider (Exercise 6)
  • pivot_longer (Exercise 7)
  • separate (Exercise 8 & 9)
  • unite (Exercise 10)
  • handling missing values

Tidyverse

Tidyverse overview

  • Tidyverse is an opinionated collection of R packages designed for data science.
  • All packages share an underlying design philosophy, grammar, and data structures.
  • Comprised of a few dozen packages - we will focus mainly on tidyr, dplyr, and ggplot2

Tidyverse workflow

What is tidy data?

“Tidy datasets are all alike but every messy dataset is messy in its own way.”

— Hadley Wickham

Tidying exercises

Data import with readr

N.B. note the difference between readr::read_csv() and the built-in read.csv(). Use the former.

finding files in a project with here

The here packages makes it easy to reference files in an R project.

library(here)
here("data/syllabus.tsv")
[1] "/Users/mtaliaferro/Documents/GitHub/molb-7950/data/syllabus.tsv"

Paths identified by here() can be fed to e.g. read_* functions.

read_tsv(here("data/syllabus.tsv"))
# A tibble: 35 × 6
   date                block    topic      instructor   location  title         
   <dttm>              <chr>    <chr>      <chr>        <chr>     <chr>         
 1 2024-08-26 00:00:00 Bootcamp R          Taliaferro   Ed2N 2106 Intro to R & …
 2 2024-08-27 00:00:00 Bootcamp R          Taliaferro   Ed1 1400  Tidy data & t…
 3 2024-08-28 00:00:00 Bootcamp R          Taliaferro   Ed2N 2102 dplyr         
 4 2024-08-29 00:00:00 Bootcamp R          Taliaferro   AHSB 2200 ggplot2       
 5 2024-08-30 00:00:00 Bootcamp R          Taliaferro   Ed1 1400  ggplot2       
 6 2024-09-02 00:00:00 <NA>     <NA>       <NA>         <NA>      NO CLASS: LAB…
 7 2024-09-03 00:00:00 Bootcamp R          Taliaferro   Ed1 1400  tidyverse odd…
 8 2024-09-04 00:00:00 Bootcamp R          Taliaferro   Ed1 1400  putting it al…
 9 2024-09-05 00:00:00 Bootcamp R          Taliaferro   Ed1 1400  putting it al…
10 2024-09-06 00:00:00 Bootcamp Statistics Ramachandran AHSB 2200 Stats intro a…
# ℹ 25 more rows

Data sets for tidying - Exercise 2

  • In this class, we will use the data sets that come with the tidyr package to explore all the functions provided by tidyr.

  • table1, table2, table3, table4a, table4b, and table5 all display the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000.

  • Let’s take a look at these data sets.

Getting familiar with the data - Exercise 3

R provides many functions to examine features of data:

  • View(): To open the table up in an excel-like interface - not recommended for large tables
  • class(): what kind of object is it (high-level)?
  • typeof(): what is an object data type (low-level)?
  • is_tibble(): use is functions to query data type
  • str(): what is the structure of the object?
  • attributes(): does it have any metadata?

Let’s use these to explore table1

Getting familiar with the data - Exercise 4

Useful functions for tibbles:

  • head() - shows first 6 rows
  • tail() - shows last 6 rows
  • dim() - returns the dimensions of data frame (i.e. number of rows and number of columns)
  • nrow() - number of rows
  • ncol() - number of columns
  • names() or colnames() - both show the names attribute for a data frame
  • glimpse()

Use these on table1

Exercise 5 - summary, hist, & table

summary(): A generic function used to produce result summaries of the results of various model fitting functions.

summary(table1) # summary of the whole table
   country               year          cases          population       
 Length:6           Min.   :1999   Min.   :   745   Min.   :1.999e+07  
 Class :character   1st Qu.:1999   1st Qu.: 11434   1st Qu.:5.845e+07  
 Mode  :character   Median :2000   Median : 59112   Median :1.733e+08  
                    Mean   :2000   Mean   : 91277   Mean   :4.901e+08  
                    3rd Qu.:2000   3rd Qu.:179316   3rd Qu.:9.983e+08  
                    Max.   :2000   Max.   :213766   Max.   :1.280e+09  

you can also summarize one variable, e.g. summary(table1$cases)

hist(): takes a vector and plots a histogram.

hist(table1$cases)

table(): uses classifying factors to build a contingency table of the counts for each factor combination.

table(table2$year) # tally of one column

1999 2000 
   6    6 
table(table2$year, table2$type) # two columns at a time
      
       cases population
  1999     3          3
  2000     3          3

Main verbs in tidyr

Tidying data

The four verbs (functions) to keep in mind for reshaping data with tidyr are:

  • pivot_wider()
  • pivot_longer()
  • separate()
  • unite()

There are other verbs as well - look at the tidyr cheatsheet!

pivot function - visualized

::: aside tidyexplain from @gadenbuie :::

pivot_wider() “widens” data, increasing the number of columns and decreasing the number of rows.

Is the left tibble tidy?

pivot_wider() - Exercise 6

What will the output look like?

table2
# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583
pivot_wider(
  table2,
  names_from = type,
  values_from = count
)
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

If you want to save the output, assign it to a new variable. This new variable will appear in your Environment tab.

table2_tidy <- pivot_wider(
  table2,
  names_from = type,
  values_from = count
)

table2_tidy
# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns.

Is the left tibble tidy?

pivot_longer() - Exercise 7

What will the output look like?

table4a
# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766
pivot_longer(
  table4a,
  2:3,
  names_to = "year",
  values_to = "cases"
)
# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766
pivot_longer(
  table4a,
  -country,
  names_to = "year",
  values_to = "cases"
)
# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766
table4_tidy <- pivot_longer(
  table4a,
  -country,
  names_to = "year",
  values_to = "cases"
)

table4_tidy
# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

separate - syntax

Given either a regular expression or a vector of character positions, separate() turns a single character column into multiple columns.

separate() syntax

separate(
  data,
  col,
  into,
  sep = "[^[:alnum:]]+", #any sequence of non-alphanumeric values
  remove = TRUE, # default is to remove the original column
  convert = FALSE, # default is to not convert
  extra = "warn",
  fill = "warn",
  ...
)

separate - Exercise 8

What will the output look like?

table3
# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583
separate(
  table3,
  rate,
  into = c("cases", "population"),
  sep = "/"
)
# A tibble: 6 × 4
  country      year cases  population
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

unite() - syntax

unite() combines multiple columns into a single column.

unite() syntax

unite(
  data, 
  col, 
  ..., # select columns to unite
  sep = "[^[:alnum:]]+", #any sequence of non-alphanumeric values
  remove = TRUE, #default is set to TRUE
  na.rm = FALSE #default is set to FALSE
)

unite - Exercise 9

table6 <- read_csv(
  here("data/bootcamp/table6.csv"),
  show_col_types = FALSE
)

table6
# A tibble: 3 × 5
   ...1 country    century  year  cases
  <dbl> <chr>        <dbl> <dbl>  <dbl>
1     1 Afganistan      19    99    745
2     2 Brazil          19    99  37737
3     3 China           19    99 212258
unite(
  table6,
  "new",
  century,
  year,
  sep = ""
)
# A tibble: 3 × 4
   ...1 country    new    cases
  <dbl> <chr>      <chr>  <dbl>
1     1 Afganistan 1999     745
2     2 Brazil     1999   37737
3     3 China      1999  212258

Handling missing values

Handling missing values

df <- tibble(
  x = c(1, 2, NA),
  y = c("a", NA, "b")
)

df
# A tibble: 3 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 <NA> 
3    NA b    


drop_na(df)
# A tibble: 1 × 2
      x y    
  <dbl> <chr>
1     1 a    


replace_na(
  df,
  list(x = 0, y = "unknown")
)
# A tibble: 3 × 2
      x y      
  <dbl> <chr>  
1     1 a      
2     2 unknown
3     0 b