[1] "/Users/mtaliaferro/Documents/GitHub/molb-7950/data/syllabus.tsv"
Tidy data & tidyr
RNA Bioscience Initiative | CU Anschutz
2024-10-21
“Tidy datasets are all alike but every messy dataset is messy in its own way.”
— Hadley Wickham
N.B. note the difference between readr::read_csv()
and the built-in read.csv()
. Use the former.
here
The here
packages makes it easy to reference files in an R project.
here()
can be fed to e.g. read_*
functions.# A tibble: 35 × 6
date block topic instructor location title
<dttm> <chr> <chr> <chr> <chr> <chr>
1 2024-08-26 00:00:00 Bootcamp R Taliaferro Ed2N 2106 Intro to R & …
2 2024-08-27 00:00:00 Bootcamp R Taliaferro Ed1 1400 Tidy data & t…
3 2024-08-28 00:00:00 Bootcamp R Taliaferro Ed2N 2102 dplyr
4 2024-08-29 00:00:00 Bootcamp R Taliaferro AHSB 2200 ggplot2
5 2024-08-30 00:00:00 Bootcamp R Taliaferro Ed1 1400 ggplot2
6 2024-09-02 00:00:00 <NA> <NA> <NA> <NA> NO CLASS: LAB…
7 2024-09-03 00:00:00 Bootcamp R Taliaferro Ed1 1400 tidyverse odd…
8 2024-09-04 00:00:00 Bootcamp R Taliaferro Ed1 1400 putting it al…
9 2024-09-05 00:00:00 Bootcamp R Taliaferro Ed1 1400 putting it al…
10 2024-09-06 00:00:00 Bootcamp Statistics Ramachandran AHSB 2200 Stats intro a…
# ℹ 25 more rows
In this class, we will use the data sets that come with the tidyr package to explore all the functions provided by tidyr.
table1
, table2
, table3
, table4a
, table4b
, and table5
all display the number of TB cases documented by the World Health Organization in Afghanistan, Brazil, and China between 1999 and 2000.
Let’s take a look at these data sets.
R provides many functions to examine features of data:
View()
: To open the table up in an excel-like interface - not recommended for large tablesclass()
: what kind of object is it (high-level)?typeof()
: what is an object data type (low-level)?is_tibble()
: use is
functions to query data typestr()
: what is the structure of the object?attributes()
: does it have any metadata?Let’s use these to explore table1
Useful functions for tibbles:
head()
- shows first 6 rowstail()
- shows last 6 rowsdim()
- returns the dimensions of data frame (i.e. number of rows and number of columns)nrow()
- number of rowsncol()
- number of columnsnames()
or colnames()
- both show the names attribute for a data frameglimpse()
Use these on table1
summary()
: A generic function used to produce result summaries of the results of various model fitting functions.
country year cases population
Length:6 Min. :1999 Min. : 745 Min. :1.999e+07
Class :character 1st Qu.:1999 1st Qu.: 11434 1st Qu.:5.845e+07
Mode :character Median :2000 Median : 59112 Median :1.733e+08
Mean :2000 Mean : 91277 Mean :4.901e+08
3rd Qu.:2000 3rd Qu.:179316 3rd Qu.:9.983e+08
Max. :2000 Max. :213766 Max. :1.280e+09
you can also summarize one variable, e.g. summary(table1$cases)
hist()
: takes a vector and plots a histogram.
table()
: uses classifying factors to build a contingency table of the counts for each factor combination.
The four verbs (functions) to keep in mind for reshaping data with tidyr
are:
pivot_wider()
pivot_longer()
separate()
unite()
There are other verbs as well - look at the tidyr
cheatsheet!
::: aside tidyexplain from @gadenbuie :::
pivot_wider()
“widens” data, increasing the number of columns and decreasing the number of rows.
Is the left tibble tidy?
pivot_wider()
- Exercise 6What will the output look like?
# A tibble: 12 × 4
country year type count
<chr> <dbl> <chr> <dbl>
1 Afghanistan 1999 cases 745
2 Afghanistan 1999 population 19987071
3 Afghanistan 2000 cases 2666
4 Afghanistan 2000 population 20595360
5 Brazil 1999 cases 37737
6 Brazil 1999 population 172006362
7 Brazil 2000 cases 80488
8 Brazil 2000 population 174504898
9 China 1999 cases 212258
10 China 1999 population 1272915272
11 China 2000 cases 213766
12 China 2000 population 1280428583
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
If you want to save the output, assign it to a new variable. This new variable will appear in your Environment tab.
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <dbl> <dbl>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
pivot_longer()
“lengthens” data, increasing the number of rows and decreasing the number of columns.
Is the left tibble tidy?
pivot_longer()
- Exercise 7What will the output look like?
# A tibble: 3 × 3
country `1999` `2000`
<chr> <dbl> <dbl>
1 Afghanistan 745 2666
2 Brazil 37737 80488
3 China 212258 213766
# A tibble: 6 × 3
country year cases
<chr> <chr> <dbl>
1 Afghanistan 1999 745
2 Afghanistan 2000 2666
3 Brazil 1999 37737
4 Brazil 2000 80488
5 China 1999 212258
6 China 2000 213766
Given either a regular expression or a vector of character positions, separate()
turns a single character column into multiple columns.
separate()
syntaxWhat will the output look like?
# A tibble: 6 × 3
country year rate
<chr> <dbl> <chr>
1 Afghanistan 1999 745/19987071
2 Afghanistan 2000 2666/20595360
3 Brazil 1999 37737/172006362
4 Brazil 2000 80488/174504898
5 China 1999 212258/1272915272
6 China 2000 213766/1280428583
# A tibble: 6 × 4
country year cases population
<chr> <dbl> <chr> <chr>
1 Afghanistan 1999 745 19987071
2 Afghanistan 2000 2666 20595360
3 Brazil 1999 37737 172006362
4 Brazil 2000 80488 174504898
5 China 1999 212258 1272915272
6 China 2000 213766 1280428583
unite()
- syntaxunite()
combines multiple columns into a single column.
unite()
syntax# A tibble: 3 × 5
...1 country century year cases
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 Afganistan 19 99 745
2 2 Brazil 19 99 37737
3 3 China 19 99 212258
# A tibble: 3 × 2
x y
<dbl> <chr>
1 1 a
2 2 <NA>
3 NA b
# A tibble: 3 × 2
x y
<dbl> <chr>
1 1 a
2 2 unknown
3 0 b
Course website: https://rnabioco.github.io/molb-7950