Problem Set 2 Key

Author

JH

Published

September 6, 2025

Problem Set

Each problem below is worth 4 points.

Use the data files in the data/ directory to answer the questions.

For this problem set, you are allowed to help each other, but you are not allowed to post correct answers in slack.

The problem set is due 5pm on Aug 27.

Grading rubric

  • Everything is good: 5 points
  • Partially correct answers: 3-4 points
  • Reasonable attempt: 2 points

Question 1

Start by loading the libraries you need for analysis below. When in doubt, start by loading the tidyverse package. You should also load here.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
here() starts at /Users/jayhesselberth/devel/rnabioco/molb-7950

Now import the dataset data_transcript_exp_subset using the readr package. Use read_csv() to import the file.

The file is located at data/data_transcript_exp_subset.csv.gz - use here() to create the complete path.

x <- read_csv(here("data/bootcamp/data_transcript_exp_subset.csv.gz"))
Rows: 100 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): ensembl_transcript_id
dbl (6): rna_0h_rep1, rna_0h_rep2, rna_0h_rep3, rna_14h_rep1, rna_14h_rep2, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Question 2

Explore the dataset. Is this dataset tidy? If not, why not?

This data frame is a subset (100 lines) of transcript-level gene expression data where transcript abundance was measured at two different time points of a certain treatment conducted in triplicates. The column names have the format of molecule_time_replicate

First, explore the structure of the dataset using some of the functions we learned in class. Try using glimpse(), summary(), and names() to understand the data structure.

x
# A tibble: 100 × 7
   ensembl_transcript_id        rna_0h_rep1 rna_0h_rep2 rna_0h_rep3 rna_14h_rep1
   <chr>                              <dbl>       <dbl>       <dbl>        <dbl>
 1 ENST00000327044.6_51_2298          243         322         303         177   
 2 ENST00000338591.7_360_2034          19          17          15           9   
 3 ENST00000379389.4_176_647           45          53          48          11   
 4 ENST00000379370.6_1158_6186         42          50          52          32   
 5 ENST00000379339.5_212_1352          17          19          25           3   
 6 ENST00000263741.11_1328_1496        27.5        33.7        36.3        22.5 
 7 ENST00000360001.10_285_1350        158         170.        171.        121   
 8 ENST00000263741.11_315_1338        148.        162.        158.        116.  
 9 ENST00000379198.3_138_1002          11          21          23           6   
10 ENST00000347370.6_475_1096          27.3        23.8        28.5         7.33
# ℹ 90 more rows
# ℹ 2 more variables: rna_14h_rep2 <dbl>, rna_14h_rep3 <dbl>
# Let's explore the data more systematically
# Look at the column names
names(x)
[1] "ensembl_transcript_id" "rna_0h_rep1"           "rna_0h_rep2"          
[4] "rna_0h_rep3"           "rna_14h_rep1"          "rna_14h_rep2"         
[7] "rna_14h_rep3"         
 ensembl_transcript_id  rna_0h_rep1       rna_0h_rep2        rna_0h_rep3      
 Length:100            Min.   :   0.00   Min.   :    0.00   Min.   :    0.00  
 Class :character      1st Qu.:  10.70   1st Qu.:   11.88   1st Qu.:   11.12  
 Mode  :character      Median :  27.41   Median :   31.05   Median :   31.91  
                       Mean   : 173.31   Mean   :  196.08   Mean   :  186.10  
                       3rd Qu.:  87.08   3rd Qu.:  105.00   3rd Qu.:   88.33  
                       Max.   :9802.00   Max.   :11144.00   Max.   :10619.00  
  rna_14h_rep1       rna_14h_rep2       rna_14h_rep3     
 Min.   :   0.000   Min.   :   0.000   Min.   :   0.000  
 1st Qu.:   3.875   1st Qu.:   3.962   1st Qu.:   5.000  
 Median :  10.435   Median :   9.665   Median :   9.665  
 Mean   : 102.875   Mean   :  93.370   Mean   : 111.515  
 3rd Qu.:  41.000   3rd Qu.:  38.750   3rd Qu.:  48.750  
 Max.   :5292.000   Max.   :5090.000   Max.   :6012.000  
Rows: 100
Columns: 7
$ ensembl_transcript_id <chr> "ENST00000327044.6_51_2298", "ENST00000338591.7_…
$ rna_0h_rep1           <dbl> 243.00, 19.00, 45.00, 42.00, 17.00, 27.50, 158.0…
$ rna_0h_rep2           <dbl> 322.00, 17.00, 53.00, 50.00, 19.00, 33.67, 169.6…
$ rna_0h_rep3           <dbl> 303.00, 15.00, 48.00, 52.00, 25.00, 36.33, 171.3…
$ rna_14h_rep1          <dbl> 177.00, 9.00, 11.00, 32.00, 3.00, 22.50, 121.00,…
$ rna_14h_rep2          <dbl> 177.00, 5.00, 5.00, 31.00, 0.00, 29.17, 124.17, …
$ rna_14h_rep3          <dbl> 239.00, 8.00, 14.00, 30.00, 2.00, 27.33, 155.33,…

Comment on whether this dataset is tidy, and if not, list the reasons why.

Hint: In a tidy dataframe, every column represents a single variable and every row represents a single observation

Answer

It is not tidy because: 1. The time points and replicates are not in their own columns 2. Multiple variables (molecule type, time, replicate) are encoded in column names 3. Each row contains multiple observations (different time points and replicates)

Question 3

How will you reshape the data frame so that each row has only one experimental observation?

Before we reshape, let’s think about what we want:

  • Which column should stay the same? (The transcript ID)
  • Which columns contain the measurements? (All the others)
  • What should we call the new column names?

Use pivot_longer() to reshape the data. You’ll want to:

  • Keep the ensembl_transcript_id column as-is (use cols = -ensembl_transcript_id)
  • Create a new column for the condition names (use names_to = "condition")
  • Create a new column for the values (use values_to = "count")
# Reshape the data so each row is one observation
x_long <-
  pivot_longer(
    x,
    cols = -ensembl_transcript_id, # everything except the ID column
    names_to = "condition", # new column for the condition names
    values_to = "count" # new column for the count values
  )

x_long
# A tibble: 600 × 3
   ensembl_transcript_id      condition    count
   <chr>                      <chr>        <dbl>
 1 ENST00000327044.6_51_2298  rna_0h_rep1    243
 2 ENST00000327044.6_51_2298  rna_0h_rep2    322
 3 ENST00000327044.6_51_2298  rna_0h_rep3    303
 4 ENST00000327044.6_51_2298  rna_14h_rep1   177
 5 ENST00000327044.6_51_2298  rna_14h_rep2   177
 6 ENST00000327044.6_51_2298  rna_14h_rep3   239
 7 ENST00000338591.7_360_2034 rna_0h_rep1     19
 8 ENST00000338591.7_360_2034 rna_0h_rep2     17
 9 ENST00000338591.7_360_2034 rna_0h_rep3     15
10 ENST00000338591.7_360_2034 rna_14h_rep1     9
# ℹ 590 more rows

Question 4

How will you modify the dataframe so that multiple variables are not present in a single column?

Use separate_wider_delim() to split the condition column into separate variables. You need to:

  • Specify which column to separate (condition)
  • Specify the delimiter character (delim = "_")
  • Provide the new column names (names = c("molecule", "timepoint", "replicate"))
x_tidy <-
  separate_wider_delim(
    x_long,
    condition,
    delim = "_",
    names = c("molecule", "timepoint", "replicate")
  )

x_tidy
# A tibble: 600 × 5
   ensembl_transcript_id      molecule timepoint replicate count
   <chr>                      <chr>    <chr>     <chr>     <dbl>
 1 ENST00000327044.6_51_2298  rna      0h        rep1        243
 2 ENST00000327044.6_51_2298  rna      0h        rep2        322
 3 ENST00000327044.6_51_2298  rna      0h        rep3        303
 4 ENST00000327044.6_51_2298  rna      14h       rep1        177
 5 ENST00000327044.6_51_2298  rna      14h       rep2        177
 6 ENST00000327044.6_51_2298  rna      14h       rep3        239
 7 ENST00000338591.7_360_2034 rna      0h        rep1         19
 8 ENST00000338591.7_360_2034 rna      0h        rep2         17
 9 ENST00000338591.7_360_2034 rna      0h        rep3         15
10 ENST00000338591.7_360_2034 rna      14h       rep1          9
# ℹ 590 more rows

Question 5

How will you save your output as a TSV file?

Use write_tsv() from the readr package to save your tidy data. Provide the data object and a filename.

Hint: Use the readr cheatsheet at the bottom of this page to figure this out.

After running your new code, you should have a new file called transcripts.tidy.tsv in your working directory.

write_tsv(x_tidy, "transcripts.tidy.tsv")

Question 6

Can you reverse the process? How would you go from tidy back to wide format?

Use pivot_wider() to go from the tidy format back to the original wide format. You need to:

  • Specify where the new column names come from (names_from = c(molecule, timepoint, replicate))
  • Specify where the values come from (values_from = count)
  • Specify how to combine the names (names_sep = "_")
# Going back to wide format
pivot_wider(
  x_tidy,
  names_from = c(molecule, timepoint, replicate),
  values_from = count,
  names_sep = "_"
)
# A tibble: 100 × 7
   ensembl_transcript_id        rna_0h_rep1 rna_0h_rep2 rna_0h_rep3 rna_14h_rep1
   <chr>                              <dbl>       <dbl>       <dbl>        <dbl>
 1 ENST00000327044.6_51_2298          243         322         303         177   
 2 ENST00000338591.7_360_2034          19          17          15           9   
 3 ENST00000379389.4_176_647           45          53          48          11   
 4 ENST00000379370.6_1158_6186         42          50          52          32   
 5 ENST00000379339.5_212_1352          17          19          25           3   
 6 ENST00000263741.11_1328_1496        27.5        33.7        36.3        22.5 
 7 ENST00000360001.10_285_1350        158         170.        171.        121   
 8 ENST00000263741.11_315_1338        148.        162.        158.        116.  
 9 ENST00000379198.3_138_1002          11          21          23           6   
10 ENST00000347370.6_475_1096          27.3        23.8        28.5         7.33
# ℹ 90 more rows
# ℹ 2 more variables: rna_14h_rep2 <dbl>, rna_14h_rep3 <dbl>