Problem Set 2 Key

Author

Published

October 21, 2024

library(tidyverse)

-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr     1.1.3     v readr     2.1.4
v forcats   1.0.0     v stringr   1.5.0
v ggplot2   3.4.3     v tibble    3.2.1
v lubridate 1.9.2     v tidyr     1.3.0
v purrr     1.0.2     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(here)

here() starts at /Users/jayhesselberth/devel/rnabioco/molb-7950

Problem Set

Each problem below is worth 4 points.

Use the data files in the data/ directory to answer the questions.

For this problem set, you are allowed to help each other, but you are not allowed to post correct answers in slack.

The problem set is due 12pm on Aug 30.

Grading rubric

Everything is good: 5 points
Partially correct answers: 3-4 points
Reasonable attempt: 2 points

Question 1

Import the dataset data_transcript_exp_subset using the readr package.

Hint: The file is located at the following path data/data_transcript_exp_subset.csv.gz

x <- read_csv(here("data/data_transcript_exp_subset.csv.gz"))

Rows: 100 Columns: 7
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (1): ensembl_transcript_id
dbl (6): rna_0h_rep1, rna_0h_rep2, rna_0h_rep3, rna_14h_rep1, rna_14h_rep2, ...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Question 2

Explore the dataset. Is this dataset tidy? If not, why not?

This data frame is a subset (100 lines) of transcript-level gene expression data where transcript abundance was measured at two different time points of a certain treatment conducted in triplicates. The column names have the format of molecule_time_replicate

First, explore the structure of the dataset using some of the functions we learned in class.

# A tibble: 100 x 7
   ensembl_transcript_id        rna_0h_rep1 rna_0h_rep2 rna_0h_rep3 rna_14h_rep1
   <chr>                              <dbl>       <dbl>       <dbl>        <dbl>
 1 ENST00000327044.6_51_2298          243         322         303         177   
 2 ENST00000338591.7_360_2034          19          17          15           9   
 3 ENST00000379389.4_176_647           45          53          48          11   
 4 ENST00000379370.6_1158_6186         42          50          52          32   
 5 ENST00000379339.5_212_1352          17          19          25           3   
 6 ENST00000263741.11_1328_1496        27.5        33.7        36.3        22.5 
 7 ENST00000360001.10_285_1350        158         170.        171.        121   
 8 ENST00000263741.11_315_1338        148.        162.        158.        116.  
 9 ENST00000379198.3_138_1002          11          21          23           6   
10 ENST00000347370.6_475_1096          27.3        23.8        28.5         7.33
# i 90 more rows
# i 2 more variables: rna_14h_rep2 <dbl>, rna_14h_rep3 <dbl>

summary(x)

 ensembl_transcript_id  rna_0h_rep1       rna_0h_rep2        rna_0h_rep3      
 Length:100            Min.   :   0.00   Min.   :    0.00   Min.   :    0.00  
 Class :character      1st Qu.:  10.70   1st Qu.:   11.88   1st Qu.:   11.12  
 Mode  :character      Median :  27.41   Median :   31.05   Median :   31.91  
                       Mean   : 173.31   Mean   :  196.08   Mean   :  186.10  
                       3rd Qu.:  87.08   3rd Qu.:  105.00   3rd Qu.:   88.33  
                       Max.   :9802.00   Max.   :11144.00   Max.   :10619.00  
  rna_14h_rep1       rna_14h_rep2       rna_14h_rep3     
 Min.   :   0.000   Min.   :   0.000   Min.   :   0.000  
 1st Qu.:   3.875   1st Qu.:   3.962   1st Qu.:   5.000  
 Median :  10.435   Median :   9.665   Median :   9.665  
 Mean   : 102.875   Mean   :  93.370   Mean   : 111.515  
 3rd Qu.:  41.000   3rd Qu.:  38.750   3rd Qu.:  48.750  
 Max.   :5292.000   Max.   :5090.000   Max.   :6012.000

glimpse(x)

Rows: 100
Columns: 7
$ ensembl_transcript_id <chr> "ENST00000327044.6_51_2298", "ENST00000338591.7_~
$ rna_0h_rep1           <dbl> 243.00, 19.00, 45.00, 42.00, 17.00, 27.50, 158.0~
$ rna_0h_rep2           <dbl> 322.00, 17.00, 53.00, 50.00, 19.00, 33.67, 169.6~
$ rna_0h_rep3           <dbl> 303.00, 15.00, 48.00, 52.00, 25.00, 36.33, 171.3~
$ rna_14h_rep1          <dbl> 177.00, 9.00, 11.00, 32.00, 3.00, 22.50, 121.00,~
$ rna_14h_rep2          <dbl> 177.00, 5.00, 5.00, 31.00, 0.00, 29.17, 124.17, ~
$ rna_14h_rep3          <dbl> 239.00, 8.00, 14.00, 30.00, 2.00, 27.33, 155.33,~

Comment on whether this dataset is tidy, and if not, list the reasons why. Hint: In a tidy dataframe, every column represents a single variable and every row represents a single observation

Answer

It is not tidy because the time points and replicates are not in their own columns.

Question 3

How will you reshape the data frame so that each row has only one experimental observation?

Hint: Use pivot_longer()

x |> pivot_longer(-ensembl_transcript_id)

# A tibble: 600 x 3
   ensembl_transcript_id      name         value
   <chr>                      <chr>        <dbl>
 1 ENST00000327044.6_51_2298  rna_0h_rep1    243
 2 ENST00000327044.6_51_2298  rna_0h_rep2    322
 3 ENST00000327044.6_51_2298  rna_0h_rep3    303
 4 ENST00000327044.6_51_2298  rna_14h_rep1   177
 5 ENST00000327044.6_51_2298  rna_14h_rep2   177
 6 ENST00000327044.6_51_2298  rna_14h_rep3   239
 7 ENST00000338591.7_360_2034 rna_0h_rep1     19
 8 ENST00000338591.7_360_2034 rna_0h_rep2     17
 9 ENST00000338591.7_360_2034 rna_0h_rep3     15
10 ENST00000338591.7_360_2034 rna_14h_rep1     9
# i 590 more rows

Question 4

How will you modify the dataframe so that multiple variables are not present in a single column?

Hint: Use separate()

x_tidy <- x |>
  pivot_longer(-ensembl_transcript_id) |>
  separate(name, into = c("mol", "time", "rep"), sep = "_")

x_tidy

# A tibble: 600 x 5
   ensembl_transcript_id      mol   time  rep   value
   <chr>                      <chr> <chr> <chr> <dbl>
 1 ENST00000327044.6_51_2298  rna   0h    rep1    243
 2 ENST00000327044.6_51_2298  rna   0h    rep2    322
 3 ENST00000327044.6_51_2298  rna   0h    rep3    303
 4 ENST00000327044.6_51_2298  rna   14h   rep1    177
 5 ENST00000327044.6_51_2298  rna   14h   rep2    177
 6 ENST00000327044.6_51_2298  rna   14h   rep3    239
 7 ENST00000338591.7_360_2034 rna   0h    rep1     19
 8 ENST00000338591.7_360_2034 rna   0h    rep2     17
 9 ENST00000338591.7_360_2034 rna   0h    rep3     15
10 ENST00000338591.7_360_2034 rna   14h   rep1      9
# i 590 more rows

Question 5

How will you save your output as a TSV file?

Hint: Use the readr cheatsheet to figure this out.

https://rstudio.cloud/learn/cheat-sheets

write_csv(x_tidy, "transcripts.tidy.csv")