R Bootcamp Problem Set 3

Author

Insert your name here

Published

October 21, 2024

Setup

Start by loading libraries you need analysis in the code chunk below. When in doubt, start by loading the tidyverse package.

library(tidyverse)

-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr     1.1.3     v readr     2.1.4
v forcats   1.0.0     v stringr   1.5.0
v ggplot2   3.4.3     v tibble    3.2.1
v lubridate 1.9.2     v tidyr     1.3.0
v purrr     1.0.2     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(here)

here() starts at /Users/jayhesselberth/devel/rnabioco/molb-7950

Problem Set

Each problem below is worth 5 points.

Use the data files in the data/ directory to answer the questions.

For this problem set, you are allowed to help each other, but you are not allowed to post correct answers in slack.

The problem set is due 12pm on Aug 31.

Grading rubric

Everything is good: 5 points
Partially correct answers: 3-4 points
Reasonable attempt: 2 points

Question 1

Load the palmerpenguins package. Inspect the penguins tibble with summary.

Use drop_na() to remove rows with NA values in the penguins tibble. How many rows were removed from the tibble?

library(palmerpenguins)

summary(penguins)

      species          island    bill_length_mm  bill_depth_mm  
 Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
 Chinstrap: 68   Dream    :124   1st Qu.:39.23   1st Qu.:15.60  
 Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
                                 Mean   :43.92   Mean   :17.15  
                                 3rd Qu.:48.50   3rd Qu.:18.70  
                                 Max.   :59.60   Max.   :21.50  
                                 NA's   :2       NA's   :2      
 flipper_length_mm  body_mass_g       sex           year     
 Min.   :172.0     Min.   :2700   female:165   Min.   :2007  
 1st Qu.:190.0     1st Qu.:3550   male  :168   1st Qu.:2007  
 Median :197.0     Median :4050   NA's  : 11   Median :2008  
 Mean   :200.9     Mean   :4202                Mean   :2008  
 3rd Qu.:213.0     3rd Qu.:4750                3rd Qu.:2009  
 Max.   :231.0     Max.   :6300                Max.   :2009  
 NA's   :2         NA's   :2

penguins_nona <- drop_na(penguins)
nrow(penguins) - nrow(penguins_nona)

[1] 11

Then, use replace_na() to replace NA values in bill_length_mm and bill_depth_mm with a value of 0.

replace_na(penguins, list(bill_length_mm = 0, bill_depth_mm = 0))

# A tibble: 344 x 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <fct>   <fct>              <dbl>         <dbl>             <int>       <int>
 1 Adelie  Torgersen           39.1          18.7               181        3750
 2 Adelie  Torgersen           39.5          17.4               186        3800
 3 Adelie  Torgersen           40.3          18                 195        3250
 4 Adelie  Torgersen            0             0                  NA          NA
 5 Adelie  Torgersen           36.7          19.3               193        3450
 6 Adelie  Torgersen           39.3          20.6               190        3650
 7 Adelie  Torgersen           38.9          17.8               181        3625
 8 Adelie  Torgersen           39.2          19.6               195        4675
 9 Adelie  Torgersen           34.1          18.1               193        3475
10 Adelie  Torgersen           42            20.2               190        4250
# i 334 more rows
# i 2 more variables: sex <fct>, year <int>

Question 2

Use arrange, filter, and select on a data frame. Do the following, in order:

Import the data set data/data_transcript_exp_tidy.csv.
Sort the tibble by expression data (count) from highest to lowest level.
Filter the tibble by count > 100
Select all columns except for type

exp_tbl <- read_csv(here("data/data_transcript_exp_tidy.csv.gz"))

Rows: 600 Columns: 5
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (4): ensembl_transcript_id, type, time, replicate
dbl (1): count

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.

exp_tbl |>
  arrange(count) |>
  filter(count > 100) |>
  select(-type)

# A tibble: 109 x 4
   ensembl_transcript_id      time  replicate count
   <chr>                      <chr> <chr>     <dbl>
 1 ENST00000342753.8_291_1314 0h    rep2       101 
 2 ENST00000378251.2_29_1778  0h    rep3       102 
 3 ENST00000378230.7_524_3101 0h    rep3       105 
 4 ENST00000054666.10_116_416 14h   rep3       105 
 5 ENST00000344843.11_97_544  14h   rep1       106 
 6 ENST00000400809.7_379_1567 14h   rep3       106.
 7 ENST00000054666.10_116_416 14h   rep1       108 
 8 ENST00000615252.4_548_1268 14h   rep3       108.
 9 ENST00000445648.5_40_1390  0h    rep1       109.
10 ENST00000291386.3_370_895  14h   rep2       109 
# i 99 more rows

Question 3

How will you:

create a new column log10count that contains log10 transformed count values and
rearrange the columns in the following order: ensembl_transcript_id, type, time, replicate, count, log10count.

(Note that we have dropped extra)

Hint: Use mutate and select

exp_tbl |>
  mutate(log10count = log10(count)) |>
  select(ensembl_transcript_id, type, time, replicate, count, log10count)

# A tibble: 600 x 6
   ensembl_transcript_id      type  time  replicate count log10count
   <chr>                      <chr> <chr> <chr>     <dbl>      <dbl>
 1 ENST00000327044.6_51_2298  rna   0h    rep1        243      2.39 
 2 ENST00000327044.6_51_2298  rna   0h    rep2        322      2.51 
 3 ENST00000327044.6_51_2298  rna   0h    rep3        303      2.48 
 4 ENST00000327044.6_51_2298  rna   14h   rep1        177      2.25 
 5 ENST00000327044.6_51_2298  rna   14h   rep2        177      2.25 
 6 ENST00000327044.6_51_2298  rna   14h   rep3        239      2.38 
 7 ENST00000338591.7_360_2034 rna   0h    rep1         19      1.28 
 8 ENST00000338591.7_360_2034 rna   0h    rep2         17      1.23 
 9 ENST00000338591.7_360_2034 rna   0h    rep3         15      1.18 
10 ENST00000338591.7_360_2034 rna   14h   rep1          9      0.954
# i 590 more rows

Question 4

Calculate a per-transcript sum, while keeping the time information?

Hint: Use group_by with multiple variables, and summarise the “count” values using sum()

exp_tbl |>
  group_by(ensembl_transcript_id, time) |>
  summarize(count_sum = sum(count))

`summarise()` has grouped output by 'ensembl_transcript_id'. You can override
using the `.groups` argument.

# A tibble: 200 x 3
# Groups:   ensembl_transcript_id [100]
   ensembl_transcript_id        time  count_sum
   <chr>                        <chr>     <dbl>
 1 ENST00000054650.8_159_876    0h         33.8
 2 ENST00000054650.8_159_876    14h        16.5
 3 ENST00000054666.10_116_416   0h        447  
 4 ENST00000054666.10_116_416   14h       281  
 5 ENST00000054668.5_220_418    0h          0  
 6 ENST00000054668.5_220_418    14h        22.5
 7 ENST00000234590.8_121_1423   0h      31565  
 8 ENST00000234590.8_121_1423   14h     16394  
 9 ENST00000263741.11_1328_1496 0h         97.5
10 ENST00000263741.11_1328_1496 14h        79  
# i 190 more rows