R Bootcamp - Day 2

Tidy data & tidyr

Matt Taliaferro

RNA Bioscience Initiative | CU Anschutz

2024-10-21

Class 2 outline

Become familiar with “tidyverse” suite of packages
Introduce readr (Exercise 1)
Introduce tidyr (Exercise 2)
Tools for data exploration (Exercises 3-5)

Learn tidyr functions:
pivot_wider (Exercise 6)
pivot_longer (Exercise 7)
separate (Exercise 8 & 9)
unite (Exercise 10)
handling missing values

Tidyverse

Tidyverse overview

Tidyverse is an opinionated collection of R packages designed for data science.
All packages share an underlying design philosophy, grammar, and data structures.
Comprised of a few dozen packages - we will focus mainly on tidyr, dplyr, and ggplot2

Tidyverse workflow

What is tidy data?

“Tidy datasets are all alike but every messy dataset is messy in its own way.”

— Hadley Wickham

Tidying exercises

Data import with readr

N.B. note the difference between readr::read_csv() and the built-in read.csv(). Use the former.

finding files in a project with `here`

The here packages makes it easy to reference files in an R project.

library(here)
here("data/syllabus.tsv")

[1] "/Users/mtaliaferro/Documents/GitHub/molb-7950/data/syllabus.tsv"

Paths identified by `here()` can be fed to e.g. `read_*` functions.

read_tsv(here("data/syllabus.tsv"))

# A tibble: 35 × 6
   date                block    topic      instructor   location  title         
   <dttm>              <chr>    <chr>      <chr>        <chr>     <chr>         
 1 2024-08-26 00:00:00 Bootcamp R          Taliaferro   Ed2N 2106 Intro to R & …
 2 2024-08-27 00:00:00 Bootcamp R          Taliaferro   Ed1 1400  Tidy data & t…
 3 2024-08-28 00:00:00 Bootcamp R          Taliaferro   Ed2N 2102 dplyr         
 4 2024-08-29 00:00:00 Bootcamp R          Taliaferro   AHSB 2200 ggplot2       
 5 2024-08-30 00:00:00 Bootcamp R          Taliaferro   Ed1 1400  ggplot2       
 6 2024-09-02 00:00:00 <NA>     <NA>       <NA>         <NA>      NO CLASS: LAB…
 7 2024-09-03 00:00:00 Bootcamp R          Taliaferro   Ed1 1400  tidyverse odd…
 8 2024-09-04 00:00:00 Bootcamp R          Taliaferro   Ed1 1400  putting it al…
 9 2024-09-05 00:00:00 Bootcamp R          Taliaferro   Ed1 1400  putting it al…
10 2024-09-06 00:00:00 Bootcamp Statistics Ramachandran AHSB 2200 Stats intro a…
# ℹ 25 more rows

Data sets for tidying - Exercise 2

In this class, we will use the data sets that come with the tidyr package to explore all the functions provided by tidyr.
table1, table2, table3, table4a, table4b, and table5 all display the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000.
Let’s take a look at these data sets.

Getting familiar with the data - Exercise 3

R provides many functions to examine features of data:

View(): To open the table up in an excel-like interface - not recommended for large tables
class(): what kind of object is it (high-level)?
typeof(): what is an object data type (low-level)?
is_tibble(): use is functions to query data type
str(): what is the structure of the object?
attributes(): does it have any metadata?

Let’s use these to explore table1

Getting familiar with the data - Exercise 4

Useful functions for tibbles:

head() - shows first 6 rows
tail() - shows last 6 rows
dim() - returns the dimensions of data frame (i.e. number of rows and number of columns)
nrow() - number of rows
ncol() - number of columns
names() or colnames() - both show the names attribute for a data frame
glimpse()

Use these on table1

Exercise 5 - summary, hist, & table

summary(): A generic function used to produce result summaries of the results of various model fitting functions.

summary(table1) # summary of the whole table

   country               year          cases          population       
 Length:6           Min.   :1999   Min.   :   745   Min.   :1.999e+07  
 Class :character   1st Qu.:1999   1st Qu.: 11434   1st Qu.:5.845e+07  
 Mode  :character   Median :2000   Median : 59112   Median :1.733e+08  
                    Mean   :2000   Mean   : 91277   Mean   :4.901e+08  
                    3rd Qu.:2000   3rd Qu.:179316   3rd Qu.:9.983e+08  
                    Max.   :2000   Max.   :213766   Max.   :1.280e+09

you can also summarize one variable, e.g. summary(table1$cases)

hist(): takes a vector and plots a histogram.

hist(table1$cases)

table(): uses classifying factors to build a contingency table of the counts for each factor combination.

table(table2$year) # tally of one column


1999 2000 
   6    6

table(table2$year, table2$type) # two columns at a time

      
       cases population
  1999     3          3
  2000     3          3

Main verbs in tidyr

Tidying data

The four verbs (functions) to keep in mind for reshaping data with tidyr are:

pivot_wider()
pivot_longer()
separate()
unite()

There are other verbs as well - look at the tidyr cheatsheet!

pivot function - visualized

::: aside tidyexplain from @gadenbuie :::

pivot_wider() “widens” data, increasing the number of columns and decreasing the number of rows.

Is the left tibble tidy?

`pivot_wider()` - Exercise 6

What will the output look like?

table2

# A tibble: 12 × 4
   country      year type            count
   <chr>       <dbl> <chr>           <dbl>
 1 Afghanistan  1999 cases             745
 2 Afghanistan  1999 population   19987071
 3 Afghanistan  2000 cases            2666
 4 Afghanistan  2000 population   20595360
 5 Brazil       1999 cases           37737
 6 Brazil       1999 population  172006362
 7 Brazil       2000 cases           80488
 8 Brazil       2000 population  174504898
 9 China        1999 cases          212258
10 China        1999 population 1272915272
11 China        2000 cases          213766
12 China        2000 population 1280428583

pivot_wider(
  table2,
  names_from = type,
  values_from = count
)

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

If you want to save the output, assign it to a new variable. This new variable will appear in your Environment tab.

table2_tidy <- pivot_wider(
  table2,
  names_from = type,
  values_from = count
)

table2_tidy

# A tibble: 6 × 4
  country      year  cases population
  <chr>       <dbl>  <dbl>      <dbl>
1 Afghanistan  1999    745   19987071
2 Afghanistan  2000   2666   20595360
3 Brazil       1999  37737  172006362
4 Brazil       2000  80488  174504898
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns.

Is the left tibble tidy?

`pivot_longer()` - Exercise 7

What will the output look like?

table4a

# A tibble: 3 × 3
  country     `1999` `2000`
  <chr>        <dbl>  <dbl>
1 Afghanistan    745   2666
2 Brazil       37737  80488
3 China       212258 213766

pivot_longer(
  table4a,
  2:3,
  names_to = "year",
  values_to = "cases"
)

# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

pivot_longer(
  table4a,
  -country,
  names_to = "year",
  values_to = "cases"
)

# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

table4_tidy <- pivot_longer(
  table4a,
  -country,
  names_to = "year",
  values_to = "cases"
)

table4_tidy

# A tibble: 6 × 3
  country     year   cases
  <chr>       <chr>  <dbl>
1 Afghanistan 1999     745
2 Afghanistan 2000    2666
3 Brazil      1999   37737
4 Brazil      2000   80488
5 China       1999  212258
6 China       2000  213766

separate - syntax

Given either a regular expression or a vector of character positions, separate() turns a single character column into multiple columns.

`separate()` syntax

separate(
  data,
  col,
  into,
  sep = "[^[:alnum:]]+", #any sequence of non-alphanumeric values
  remove = TRUE, # default is to remove the original column
  convert = FALSE, # default is to not convert
  extra = "warn",
  fill = "warn",
  ...
)

separate - Exercise 8

What will the output look like?

table3

# A tibble: 6 × 3
  country      year rate             
  <chr>       <dbl> <chr>            
1 Afghanistan  1999 745/19987071     
2 Afghanistan  2000 2666/20595360    
3 Brazil       1999 37737/172006362  
4 Brazil       2000 80488/174504898  
5 China        1999 212258/1272915272
6 China        2000 213766/1280428583

separate(
  table3,
  rate,
  into = c("cases", "population"),
  sep = "/"
)

# A tibble: 6 × 4
  country      year cases  population
  <chr>       <dbl> <chr>  <chr>     
1 Afghanistan  1999 745    19987071  
2 Afghanistan  2000 2666   20595360  
3 Brazil       1999 37737  172006362 
4 Brazil       2000 80488  174504898 
5 China        1999 212258 1272915272
6 China        2000 213766 1280428583

`unite()` - syntax

unite() combines multiple columns into a single column.

`unite()` syntax

unite(
  data, 
  col, 
  ..., # select columns to unite
  sep = "[^[:alnum:]]+", #any sequence of non-alphanumeric values
  remove = TRUE, #default is set to TRUE
  na.rm = FALSE #default is set to FALSE
)

unite - Exercise 9

table6 <- read_csv(
  here("data/bootcamp/table6.csv"),
  show_col_types = FALSE
)

table6

# A tibble: 3 × 5
   ...1 country    century  year  cases
  <dbl> <chr>        <dbl> <dbl>  <dbl>
1     1 Afganistan      19    99    745
2     2 Brazil          19    99  37737
3     3 China           19    99 212258

unite(
  table6,
  "new",
  century,
  year,
  sep = ""
)

# A tibble: 3 × 4
   ...1 country    new    cases
  <dbl> <chr>      <chr>  <dbl>
1     1 Afganistan 1999     745
2     2 Brazil     1999   37737
3     3 China      1999  212258

Handling missing values

df <- tibble(
  x = c(1, 2, NA),
  y = c("a", NA, "b")
)

df

# A tibble: 3 × 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 <NA> 
3    NA b

drop_na(df)

# A tibble: 1 × 2
      x y    
  <dbl> <chr>
1     1 a

replace_na(
  df,
  list(x = 0, y = "unknown")
)

# A tibble: 3 × 2
      x y      
  <dbl> <chr>  
1     1 a      
2     2 unknown
3     0 b

R Bootcamp - Day 2

Class 2 outline

Tidyverse

Tidyverse overview

Tidyverse workflow

What is tidy data?

Tidying exercises

Data import with readr

finding files in a project with here

Paths identified by here() can be fed to e.g. read_* functions.

Data sets for tidying - Exercise 2

Getting familiar with the data - Exercise 3

Getting familiar with the data - Exercise 4

Exercise 5 - summary, hist, & table

Main verbs in tidyr

Tidying data

pivot function - visualized

pivot_wider() - Exercise 6

pivot_longer() - Exercise 7

separate - syntax

separate() syntax

separate - Exercise 8

unite() - syntax

unite() syntax

unite - Exercise 9

Handling missing values

Handling missing values

finding files in a project with `here`

Paths identified by `here()` can be fed to e.g. `read_*` functions.

`pivot_wider()` - Exercise 6

`pivot_longer()` - Exercise 7

`separate()` syntax

`unite()` - syntax

`unite()` syntax