R Bootcamp - Day 7

tidyverse odds & ends

Jay Hesselberth

RNA Bioscience Initiative | CU Anschutz

2024-10-21

Class 7 outline

  • Accessing data in vectors (Exercise)
  • other tidyverse packages (stringr & forcats)
  • dplyr table joins (Exercise)
  • ggplot2 scale functions
  • ggplot2 multi-panel figures (Exercise)
  • ggplot2 saving figures

Accessing data in vectors

Using [, [[, and $

# `hp` vector from mtcars
mtcars$hp
 [1] 110 110  93 110 175 105 245  62  95 123 123 180 180 180 205 215 230  66  52
[20]  65  97 150 150 245 175  66  91 113 264 175 335 109
# columns by name
mtcars[c("hp", "mpg")]
                     hp  mpg
Mazda RX4           110 21.0
Mazda RX4 Wag       110 21.0
Datsun 710           93 22.8
Hornet 4 Drive      110 21.4
Hornet Sportabout   175 18.7
Valiant             105 18.1
Duster 360          245 14.3
Merc 240D            62 24.4
Merc 230             95 22.8
Merc 280            123 19.2
Merc 280C           123 17.8
Merc 450SE          180 16.4
Merc 450SL          180 17.3
Merc 450SLC         180 15.2
Cadillac Fleetwood  205 10.4
Lincoln Continental 215 10.4
Chrysler Imperial   230 14.7
Fiat 128             66 32.4
Honda Civic          52 30.4
Toyota Corolla       65 33.9
Toyota Corona        97 21.5
Dodge Challenger    150 15.5
AMC Javelin         150 15.2
Camaro Z28          245 13.3
Pontiac Firebird    175 19.2
Fiat X1-9            66 27.3
Porsche 914-2        91 26.0
Lotus Europa        113 30.4
Ford Pantera L      264 15.8
Ferrari Dino        175 19.7
Maserati Bora       335 15.0
Volvo 142E          109 21.4
# first 10 items in hp
hp <- mtcars$hp
hp[1:10]
 [1] 110 110  93 110 175 105 245  62  95 123
# 2 ways to get the 10th value
hp[10]
hp[[10]]
# this is an error
hp[[1:10]]
Error in hp[[1:10]]: attempt to select more than one element in vectorIndex

[ can return a range, [[ returns a single value.

vector selection with logic

one-step filtering.

hp[hp > 100]
 [1] 110 110 110 175 105 245 123 123 180 180 180 205 215 230 150 150 245 175 113
[20] 264 175 335 109

two-step filtering. same result.

# get a vector of T/F values
x <- hp > 100
x
 [1]  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
[13]  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
[25]  TRUE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
# use that vector to index the original
hp[x]
 [1] 110 110 110 175 105 245 123 123 180 180 180 205 215 230 150 150 245 175 113
[20] 264 175 335 109

also can use with is.na() to identify / exclude NA values in a vector.

Use sum() to figure out how many are TRUE.

x <- hp > 100

# how many are TRUE?
sum(x)
[1] 23
length(hp[x])
[1] 23

other tidyverse libraries

string operations with stringr

stringr provides several useful functions for operating on strings.

See the stringr cheatsheet

str_c("letter: ", letters[1:5])
[1] "letter: a" "letter: b" "letter: c" "letter: d" "letter: e"
ids <- c("x-1", "x-2", "y-1", "y-2")
str_split(ids, "-")
[[1]]
[1] "x" "1"

[[2]]
[1] "x" "2"

[[3]]
[1] "y" "1"

[[4]]
[1] "y" "2"
# just the first parts
# str_split_i(ids, '-', 1)
str_detect("A", LETTERS[1:5])
[1]  TRUE FALSE FALSE FALSE FALSE
str_pad(
  1:10,
  width = 2,
  side = "left",
  pad = 0
)
 [1] "01" "02" "03" "04" "05" "06" "07" "08" "09" "10"

forcats operations for factors

forcats provides several utilities for working with factors.

See the forcats cheatsheet

library(palmerpenguins)
penguins[1:3]
# A tibble: 344 × 3
   species island    bill_length_mm
   <fct>   <fct>              <dbl>
 1 Adelie  Torgersen           39.1
 2 Adelie  Torgersen           39.5
 3 Adelie  Torgersen           40.3
 4 Adelie  Torgersen           NA  
 5 Adelie  Torgersen           36.7
 6 Adelie  Torgersen           39.3
 7 Adelie  Torgersen           38.9
 8 Adelie  Torgersen           39.2
 9 Adelie  Torgersen           34.1
10 Adelie  Torgersen           42  
# ℹ 334 more rows
fct_count(penguins$species)
# A tibble: 3 × 2
  f             n
  <fct>     <int>
1 Adelie      152
2 Chinstrap    68
3 Gentoo      124
fct_lump_min(
  penguins$species,
  min = 150
) |>
  fct_count()
# A tibble: 2 × 2
  f          n
  <fct>  <int>
1 Adelie   152
2 Other    192

Use forcats to reorder data in plots

ggplot(
  mpg,
  aes(
    x = class,
    y = hwy
  )
) +
  geom_boxplot()

ggplot(
  mpg,
  aes(
    x = fct_reorder(
      class,
      hwy,
      .fun = median
    ),
    y = hwy
  )
) +
  geom_boxplot()

dplyr

Combining tables by variables (i.e., columns)

  • bind_cols()
  • left_join()
  • right_join()
  • inner_join()
  • full_join()

Combining tables by cases (i.e., rows)

  • bind_rows()
  • intersect()
  • setdiff()
  • union()

dplyr cheatsheet

Look at “combine variables” and “combine cases” at the top.

tables for joining

band_members
# A tibble: 3 × 2
  name  band   
  <chr> <chr>  
1 Mick  Stones 
2 John  Beatles
3 Paul  Beatles
band_instruments
# A tibble: 3 × 2
  name  plays 
  <chr> <chr> 
1 John  guitar
2 Paul  bass  
3 Keith guitar

mutating joins - visualized

Joining tables by a variable - Exercise 1

band_members |>
  left_join(band_instruments)
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
band_members |>
  right_join(band_instruments)
# A tibble: 3 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  
3 Keith <NA>    guitar

filtering joins - visualized

Joining tables by a variable - Exercise 2

band_members |>
  semi_join(band_instruments)
# A tibble: 2 × 2
  name  band   
  <chr> <chr>  
1 John  Beatles
2 Paul  Beatles
band_members |>
  inner_join(band_instruments)
# A tibble: 2 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 John  Beatles guitar
2 Paul  Beatles bass  
band_members |>
  full_join(
    band_instruments2,
    by = c("name" = "artist")
  )
# A tibble: 4 × 3
  name  band    plays 
  <chr> <chr>   <chr> 
1 Mick  Stones  <NA>  
2 John  Beatles guitar
3 Paul  Beatles bass  
4 Keith <NA>    guitar

Other dplyr verbs

There are many other dplyr verbs.

  • We’ve used rename, count, add_row, add_column, distinct, sample_n, sample_frac, slice, pull

Check out the dplyr cheatsheet to learn more!

ggplot2

scale functions in ggplot2

  • scale_color_brewer() and scale_fill_brewer() control color and fill aesthetics.
  • See available ggplot2 brewer palettes
p <- ggplot(
  mtcars,
  aes(
    x = mpg,
    y = hp,
    color = factor(cyl)
  )
) +
  geom_point(size = 3) +
  theme_cowplot()

p

scale functions in ggplot2

p + scale_color_brewer(palette = "Set1")

p + scale_color_brewer(palette = "Dark2")

Set up a points plot

diamonds_subset <- sample_n(diamonds, 1000)

p <- ggplot(
  diamonds_subset,
  aes(
    x = carat,
    y = price,
    color = cut
  )
) +
  geom_point(alpha = 0.8) +
  theme_cowplot()

p + geom_line()

How to combine multiple plots into a figure?

# `plot_grid()` is from `cowplot`
plot_grid(
  p, p, p, p,
  labels = c(
    "A", "B",
    "C", "D"
  ),
  nrow = 2
)

How to combine multiple plots into a figure?

We have 4 legends - can they be condensed?

Yes, but it is not exactly straightforward.

# fetch the legend for `p1`
legend <- get_legend(
  p + theme(legend.position = "bottom")
)

p <- p + theme(legend.position = "none")

# first `plot_grid` builds the panels
panels <- plot_grid(
  p, p, p, p,
  labels = c(
    "A", "B", "C", "D"
  ),
  nrow = 2
)

# second `plot_grid` adds the legend to the panels
plot_grid(
  panels,
  legend,
  ncol = 1,
  rel_heights = c(1, .1)
)

We have 4 legends - can they be condensed?

Saving plots (Exercise 18)

Saves last plot as 5’ x 5’ file named plot_final.png in working directory.

Matches file type to file extension.

# default is to save last plot in the buffer
# can also specify with the `plot` argument
ggsave(
  here("img/plot_final.png"),
  width = 5,
  height = 5
)