R Bootcamp - Day 3

dplyr

Jay Hesselberth

RNA Bioscience Initiative | CU Anschutz

2024-10-21

Class 3 outline

  • Introduce dplyr & today’s datasets (Exercise 1)
  • Review basic functions of dplyr
    • core dplyr verbs:
    • arrange (Exercise 2)
    • filter (Exercise 3)
    • select (Exercise 4)
    • mutate and the pipe (Exercise 5)
    • summarise (Exercise 6)
    • modify scope of verbs using: group_by (Exercise 7)
    • and many others! rename, count, add_row, add_column, distinct, sample_n, sample_frac, slice, pull (Exercise 8)

dplyr overview

dplyr:

  • provides a set of tools for efficiently manipulating data sets in R.
  • is extremely fast even with large data sets.
  • follows the tidyverse grammar and philosophy; human-readable and intuitive
  • encourages linking of verbs together using pipes |> (or the older %>%)

Today’s datasets

  • We will use a data set that comes with the dplyr package to explore its functions.

  • dplyr::starwars contains data for characters from Star Wars.

starwars
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Explore starwars in the console with head(), View(), and summary().

dplyr package

dplyr is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges:

  • arrange() changes the ordering of the rows.
  • filter() picks cases based on their values.
  • select() picks variables based on their names.
  • mutate() adds new variables that are functions of existing variables
  • summarise() reduces multiple values down to a single summary.
  • These all combine naturally with group_by() which allows you to perform any operation “by group”.

  • Pipes |> allows different functions to be used together to create a workflow. x |> f(y) turns into f(x, y)

arrange - Syntax

  • arrange() orders rows by values of one or more columns (low to high).
  • The desc() helper orders high to low.
arrange(data = ..., <colname>)

arrange - Exercise 2

# default is to arrange in ascending order
arrange(starwars, height)
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Yoda         66    17 white      green      brown            896 male  mascu…
 2 Ratts T…     79    15 none       grey, blue unknown           NA male  mascu…
 3 Wicket …     88    20 brown      brown      brown              8 male  mascu…
 4 Dud Bolt     94    45 none       blue, grey yellow            NA male  mascu…
 5 R2-D2        96    32 <NA>       white, bl… red               33 none  mascu…
 6 R4-P17       96    NA none       silver, r… red, blue         NA none  femin…
 7 R5-D4        97    32 <NA>       white, red red               NA none  mascu…
 8 Sebulba     112    40 none       grey, red  orange            NA male  mascu…
 9 Gasgano     122    NA none       white, bl… black             NA male  mascu…
10 Watto       137    NA black      blue, grey yellow            NA male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

arrange - Exercise 2

# arrange in descending order
arrange(starwars, desc(height))
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Yarael …    264    NA none       white      yellow          NA   male  mascu…
 2 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
 3 Lama Su     229    88 none       grey       black           NA   male  mascu…
 4 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
 5 Roos Ta…    224    82 none       grey       orange          NA   male  mascu…
 6 Grievous    216   159 none       brown, wh… green, y…       NA   male  mascu…
 7 Taun We     213    NA none       grey       black           NA   fema… femin…
 8 Rugor N…    206    NA none       green      orange          NA   male  mascu…
 9 Tion Me…    206    80 none       grey       black           NA   male  mascu…
10 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

arrange - Exercise 2

# arrange by multiple columns
arrange(starwars, height, mass)
# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Yoda         66    17 white      green      brown            896 male  mascu…
 2 Ratts T…     79    15 none       grey, blue unknown           NA male  mascu…
 3 Wicket …     88    20 brown      brown      brown              8 male  mascu…
 4 Dud Bolt     94    45 none       blue, grey yellow            NA male  mascu…
 5 R2-D2        96    32 <NA>       white, bl… red               33 none  mascu…
 6 R4-P17       96    NA none       silver, r… red, blue         NA none  femin…
 7 R5-D4        97    32 <NA>       white, red red               NA none  mascu…
 8 Sebulba     112    40 none       grey, red  orange            NA male  mascu…
 9 Gasgano     122    NA none       white, bl… black             NA male  mascu…
10 Watto       137    NA black      blue, grey yellow            NA male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

filter - Syntax

  • filter() chooses rows/cases where conditions are true.
filter(data = ..., <condition>)

filter - Exercise 3

filter(starwars, skin_color == "light")
# A tibble: 11 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Leia Or…    150    49 brown      light      brown             19 fema… femin…
 2 Owen La…    178   120 brown, gr… light      blue              52 male  mascu…
 3 Beru Wh…    165    75 brown      light      blue              47 fema… femin…
 4 Biggs D…    183    84 black      light      brown             24 male  mascu…
 5 Lobot       175    79 none       light      blue              37 male  mascu…
 6 Padmé A…    185    45 brown      light      brown             46 fema… femin…
 7 Cordé       157    NA brown      light      brown             NA <NA>  <NA>  
 8 Dormé       165    NA brown      light      brown             NA fema… femin…
 9 Raymus …    188    79 brown      light      brown             NA male  mascu…
10 Rey          NA    NA brown      light      hazel             NA fema… femin…
11 Poe Dam…     NA    NA brown      light      brown             NA male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

filter - Exercise 3

filter(starwars, height < 150)
# A tibble: 10 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 R2-D2        96    32 <NA>       white, bl… red               33 none  mascu…
 2 R5-D4        97    32 <NA>       white, red red               NA none  mascu…
 3 Yoda         66    17 white      green      brown            896 male  mascu…
 4 Wicket …     88    20 brown      brown      brown              8 male  mascu…
 5 Watto       137    NA black      blue, grey yellow            NA male  mascu…
 6 Sebulba     112    40 none       grey, red  orange            NA male  mascu…
 7 Ratts T…     79    15 none       grey, blue unknown           NA male  mascu…
 8 Dud Bolt     94    45 none       blue, grey yellow            NA male  mascu…
 9 Gasgano     122    NA none       white, bl… black             NA male  mascu…
10 R4-P17       96    NA none       silver, r… red, blue         NA none  femin…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

filter - Exercise 3

filter(
  starwars,
  mass > mean(mass, na.rm = TRUE)
)
# A tibble: 10 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 2 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 3 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
 4 Jabba D…    175  1358 <NA>       green-tan… orange         600   herm… mascu…
 5 Jek Ton…    180   110 brown      fair       blue            NA   <NA>  <NA>  
 6 IG-88       200   140 none       metal      red             15   none  mascu…
 7 Bossk       190   113 none       green      red             53   male  mascu…
 8 Dexter …    198   102 none       brown      yellow          NA   male  mascu…
 9 Grievous    216   159 none       brown, wh… green, y…       NA   male  mascu…
10 Tarfful     234   136 brown      brown      blue            NA   male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

filter - Exercise 3

Filter out cases where hair_color is NA

filter(starwars, is.na(hair_color))
# A tibble: 5 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 C-3PO        167    75 <NA>       gold       yellow           112 none  mascu…
2 R2-D2         96    32 <NA>       white, bl… red               33 none  mascu…
3 R5-D4         97    32 <NA>       white, red red               NA none  mascu…
4 Greedo       173    74 <NA>       green      black             44 male  mascu…
5 Jabba De…    175  1358 <NA>       green-tan… orange           600 herm… mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

filter - Exercise 3

  • The most frequently used comparison operators are:

  • >, <, >=, <=, == (equal), != (not equal)

  • is.na(), !is.na(), and %in% (contained in a vector of cases).

filter(
  starwars,
  skin_color %in% c("light", "fair", "pale")
)
# A tibble: 33 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 3 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 4 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 5 Biggs D…    183    84 black      light      brown           24   male  mascu…
 6 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
 7 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 8 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
 9 Han Solo    180    80 brown      fair       brown           29   male  mascu…
10 Wedge A…    170    77 brown      fair       hazel           21   male  mascu…
# ℹ 23 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
# can also store as a named vector and use %in% with the vector
color <- c("light", "fair", "pale")
filter(starwars, skin_color %in% color)
# A tibble: 33 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 3 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 4 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 5 Biggs D…    183    84 black      light      brown           24   male  mascu…
 6 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
 7 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 8 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
 9 Han Solo    180    80 brown      fair       brown           29   male  mascu…
10 Wedge A…    170    77 brown      fair       hazel           21   male  mascu…
# ℹ 23 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

Conditions can be combined using & (and), | (or).

filter(
  starwars,
  skin_color == "light" | eye_color == "brown"
)
# A tibble: 25 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Leia Or…    150  49   brown      light      brown           19   fema… femin…
 2 Owen La…    178 120   brown, gr… light      blue            52   male  mascu…
 3 Beru Wh…    165  75   brown      light      blue            47   fema… femin…
 4 Biggs D…    183  84   black      light      brown           24   male  mascu…
 5 Han Solo    180  80   brown      fair       brown           29   male  mascu…
 6 Yoda         66  17   white      green      brown          896   male  mascu…
 7 Boba Fe…    183  78.2 black      fair       brown           31.5 male  mascu…
 8 Lando C…    177  79   black      dark       brown           31   male  mascu…
 9 Lobot       175  79   none       light      blue            37   male  mascu…
10 Arvel C…     NA  NA   brown      fair       brown           NA   male  mascu…
# ℹ 15 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>
filter(
  starwars,
  skin_color == "light" & eye_color == "brown"
)
# A tibble: 7 × 14
  name      height  mass hair_color skin_color eye_color birth_year sex   gender
  <chr>      <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
1 Leia Org…    150    49 brown      light      brown             19 fema… femin…
2 Biggs Da…    183    84 black      light      brown             24 male  mascu…
3 Padmé Am…    185    45 brown      light      brown             46 fema… femin…
4 Cordé        157    NA brown      light      brown             NA <NA>  <NA>  
5 Dormé        165    NA brown      light      brown             NA fema… femin…
6 Raymus A…    188    79 brown      light      brown             NA male  mascu…
7 Poe Dame…     NA    NA brown      light      brown             NA male  mascu…
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

select - Syntax

  • select extracts one or more columns from a table
select(data = ..., <colname>)  

select - Exercise 4

# select *only* the variable `hair_color`
select(starwars, hair_color)
# A tibble: 87 × 1
   hair_color   
   <chr>        
 1 blond        
 2 <NA>         
 3 <NA>         
 4 none         
 5 brown        
 6 brown, grey  
 7 brown        
 8 <NA>         
 9 black        
10 auburn, white
# ℹ 77 more rows

select - Exercise 4

# drop the variable `hair_color`
select(starwars, -hair_color)
# A tibble: 87 × 13
   name      height  mass skin_color eye_color birth_year sex   gender homeworld
   <chr>      <int> <dbl> <chr>      <chr>          <dbl> <chr> <chr>  <chr>    
 1 Luke Sky…    172    77 fair       blue            19   male  mascu… Tatooine 
 2 C-3PO        167    75 gold       yellow         112   none  mascu… Tatooine 
 3 R2-D2         96    32 white, bl… red             33   none  mascu… Naboo    
 4 Darth Va…    202   136 white      yellow          41.9 male  mascu… Tatooine 
 5 Leia Org…    150    49 light      brown           19   fema… femin… Alderaan 
 6 Owen Lars    178   120 light      blue            52   male  mascu… Tatooine 
 7 Beru Whi…    165    75 light      blue            47   fema… femin… Tatooine 
 8 R5-D4         97    32 white, red red             NA   none  mascu… Tatooine 
 9 Biggs Da…    183    84 light      brown           24   male  mascu… Tatooine 
10 Obi-Wan …    182    77 fair       blue-gray       57   male  mascu… Stewjon  
# ℹ 77 more rows
# ℹ 4 more variables: species <chr>, films <list>, vehicles <list>,
#   starships <list>

select - Exercise 4

select(starwars, hair_color, skin_color, eye_color)
# A tibble: 87 × 3
   hair_color    skin_color  eye_color
   <chr>         <chr>       <chr>    
 1 blond         fair        blue     
 2 <NA>          gold        yellow   
 3 <NA>          white, blue red      
 4 none          white       yellow   
 5 brown         light       brown    
 6 brown, grey   light       blue     
 7 brown         light       blue     
 8 <NA>          white, red  red      
 9 black         light       brown    
10 auburn, white fair        blue-gray
# ℹ 77 more rows

select - Exercise 4

# select variables `hair_color` through `eye_color`
select(starwars, hair_color:eye_color)
# A tibble: 87 × 3
   hair_color    skin_color  eye_color
   <chr>         <chr>       <chr>    
 1 blond         fair        blue     
 2 <NA>          gold        yellow   
 3 <NA>          white, blue red      
 4 none          white       yellow   
 5 brown         light       brown    
 6 brown, grey   light       blue     
 7 brown         light       blue     
 8 <NA>          white, red  red      
 9 black         light       brown    
10 auburn, white fair        blue-gray
# ℹ 77 more rows

select - Exercise 4

# drop variables `hair_color` through `eye_color`
select(starwars, !(hair_color:eye_color))
# A tibble: 87 × 11
   name    height  mass birth_year sex   gender homeworld species films vehicles
   <chr>    <int> <dbl>      <dbl> <chr> <chr>  <chr>     <chr>   <lis> <list>  
 1 Luke S…    172    77       19   male  mascu… Tatooine  Human   <chr> <chr>   
 2 C-3PO      167    75      112   none  mascu… Tatooine  Droid   <chr> <chr>   
 3 R2-D2       96    32       33   none  mascu… Naboo     Droid   <chr> <chr>   
 4 Darth …    202   136       41.9 male  mascu… Tatooine  Human   <chr> <chr>   
 5 Leia O…    150    49       19   fema… femin… Alderaan  Human   <chr> <chr>   
 6 Owen L…    178   120       52   male  mascu… Tatooine  Human   <chr> <chr>   
 7 Beru W…    165    75       47   fema… femin… Tatooine  Human   <chr> <chr>   
 8 R5-D4       97    32       NA   none  mascu… Tatooine  Droid   <chr> <chr>   
 9 Biggs …    183    84       24   male  mascu… Tatooine  Human   <chr> <chr>   
10 Obi-Wa…    182    77       57   male  mascu… Stewjon   Human   <chr> <chr>   
# ℹ 77 more rows
# ℹ 1 more variable: starships <list>

select - Exercise 4

# `starts_with`, `ends_with`, `contains`
select(starwars, ends_with("color"))
# A tibble: 87 × 3
   hair_color    skin_color  eye_color
   <chr>         <chr>       <chr>    
 1 blond         fair        blue     
 2 <NA>          gold        yellow   
 3 <NA>          white, blue red      
 4 none          white       yellow   
 5 brown         light       brown    
 6 brown, grey   light       blue     
 7 brown         light       blue     
 8 <NA>          white, red  red      
 9 black         light       brown    
10 auburn, white fair        blue-gray
# ℹ 77 more rows

mutate - Syntax

  • mutate() to compute new columns
mutate(data = ..., <newcolname> =  funs(<oldcolname>))
mutate(data = ..., <colname>, funs(x))

or with the the pipe |>

Useful when multiple functions act sequentially on a dataframe.

data |>
  mutate(<colname>, funs(x)) 

mutate (& pipe |>)- Exercise 5

# create a new column to display height in meters
mutate(starwars, height_m = height / 100)
# A tibble: 87 × 15
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 6 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>, height_m <dbl>

mutate (& pipe |>)- Exercise 5

# using the pipe to feed data into multiple functions sequentially
starwars |>
  mutate(height_m = height / 100) |>
  select(name, height_m, height, everything())
# A tibble: 87 × 15
   name   height_m height  mass hair_color skin_color eye_color birth_year sex  
   <chr>     <dbl>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Luke …     1.72    172    77 blond      fair       blue            19   male 
 2 C-3PO      1.67    167    75 <NA>       gold       yellow         112   none 
 3 R2-D2      0.96     96    32 <NA>       white, bl… red             33   none 
 4 Darth…     2.02    202   136 none       white      yellow          41.9 male 
 5 Leia …     1.5     150    49 brown      light      brown           19   fema…
 6 Owen …     1.78    178   120 brown, gr… light      blue            52   male 
 7 Beru …     1.65    165    75 brown      light      blue            47   fema…
 8 R5-D4      0.97     97    32 <NA>       white, red red             NA   none 
 9 Biggs…     1.83    183    84 black      light      brown           24   male 
10 Obi-W…     1.82    182    77 auburn, w… fair       blue-gray       57   male 
# ℹ 77 more rows
# ℹ 6 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

mutate (& pipe |>)- Exercise 5

Mutate allows you to refer to columns that you’ve just created

starwars |>
  mutate(
    height_m = height / 100,
    BMI = mass / (height_m^2)
  ) |>
  select(name, BMI, everything())
# A tibble: 87 × 16
   name        BMI height  mass hair_color skin_color eye_color birth_year sex  
   <chr>     <dbl>  <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr>
 1 Luke Sky…  26.0    172    77 blond      fair       blue            19   male 
 2 C-3PO      26.9    167    75 <NA>       gold       yellow         112   none 
 3 R2-D2      34.7     96    32 <NA>       white, bl… red             33   none 
 4 Darth Va…  33.3    202   136 none       white      yellow          41.9 male 
 5 Leia Org…  21.8    150    49 brown      light      brown           19   fema…
 6 Owen Lars  37.9    178   120 brown, gr… light      blue            52   male 
 7 Beru Whi…  27.5    165    75 brown      light      blue            47   fema…
 8 R5-D4      34.0     97    32 <NA>       white, red red             NA   none 
 9 Biggs Da…  25.1    183    84 black      light      brown           24   male 
10 Obi-Wan …  23.2    182    77 auburn, w… fair       blue-gray       57   male 
# ℹ 77 more rows
# ℹ 7 more variables: gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>, height_m <dbl>

Output needs to be saved into a new data frame since dplyr does not “change” the original dataframe.

starwars_bmi <- starwars |>
  mutate(
    height_m = height / 100,
    BMI = mass / (height_m^2)
  ) |>
  select(name, BMI, everything())

Using case_when()clauses with mutate().

Let’s create a new variable tall_short based on other values.

starwars |>
  mutate(
    tall_short = case_when(
      height > 160 ~ "tall",
      .default = "short"
    )
  ) |>
  select(name, tall_short, everything())
# A tibble: 87 × 15
   name       tall_short height  mass hair_color skin_color eye_color birth_year
   <chr>      <chr>       <int> <dbl> <chr>      <chr>      <chr>          <dbl>
 1 Luke Skyw… tall          172    77 blond      fair       blue            19  
 2 C-3PO      tall          167    75 <NA>       gold       yellow         112  
 3 R2-D2      short          96    32 <NA>       white, bl… red             33  
 4 Darth Vad… tall          202   136 none       white      yellow          41.9
 5 Leia Orga… short         150    49 brown      light      brown           19  
 6 Owen Lars  tall          178   120 brown, gr… light      blue            52  
 7 Beru Whit… tall          165    75 brown      light      blue            47  
 8 R5-D4      short          97    32 <NA>       white, red red             NA  
 9 Biggs Dar… tall          183    84 black      light      brown           24  
10 Obi-Wan K… tall          182    77 auburn, w… fair       blue-gray       57  
# ℹ 77 more rows
# ℹ 7 more variables: sex <chr>, gender <chr>, homeworld <chr>, species <chr>,
#   films <list>, vehicles <list>, starships <list>

group_by() & summarise() - Exercise 6

group_by creates a grouped copy of a table.

  • This changes the unit of analysis from the complete data set to individual groups.
  • dplyr verbs automatically detect grouped tables and calculate “by group”.
group_by(data = ..., <colname>)

group_by - Syntax

  • group_by() creates a grouped tibble.
  • This changes the unit of analysis from the complete dataset to individual groups.
  • Then, when you use the dplyr verbs on a grouped data frame they’ll be automatically applied “by group”.
group_by(data = ..., <colname>)

group_by + summarize - Exercise 7

starwars |>
  group_by(species)
# A tibble: 87 × 14
# Groups:   species [38]
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

summarize - syntax

  • summarize() takes named expressions and calculates a summary based on group.
summarize(data = ..., name = expression)

Calculate a summary statistic by species

starwars |>
  group_by(species) |>
  summarise(
    height = mean(height, na.rm = TRUE)
  )
# A tibble: 38 × 2
   species   height
   <chr>      <dbl>
 1 Aleena       79 
 2 Besalisk    198 
 3 Cerean      198 
 4 Chagrian    196 
 5 Clawdite    168 
 6 Droid       131.
 7 Dug         112 
 8 Ewok         88 
 9 Geonosian   183 
10 Gungan      209.
# ℹ 28 more rows

Calucate multiple summary statistics.

starwars |>
  group_by(species, gender) |>
  summarise(
    height = mean(height, na.rm = TRUE),
    mass = mean(mass, na.rm = TRUE)
  )
`summarise()` has grouped output by 'species'. You can override using the
`.groups` argument.
# A tibble: 42 × 4
# Groups:   species [38]
   species   gender    height  mass
   <chr>     <chr>      <dbl> <dbl>
 1 Aleena    masculine     79  15  
 2 Besalisk  masculine    198 102  
 3 Cerean    masculine    198  82  
 4 Chagrian  masculine    196 NaN  
 5 Clawdite  feminine     168  55  
 6 Droid     feminine      96 NaN  
 7 Droid     masculine    140  69.8
 8 Dug       masculine    112  40  
 9 Ewok      masculine     88  20  
10 Geonosian masculine    183  80  
# ℹ 32 more rows