Problem Set
Each problem below is worth 4 points .
Use the data files in the data/
directory to answer the questions.
For this problem set, you are allowed to help each other, but you are not allowed to post correct answers in slack.
The problem set is due 5pm on Aug 27.
Grading rubric
Everything is good: 5 points
Partially correct answers: 3-4 points
Reasonable attempt: 2 points
Question 1
Start by loading the libraries you need for analysis below. When in doubt, start by loading the tidyverse package. You should also load here
.
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.2 ✔ tibble 3.2.1
✔ lubridate 1.9.4 ✔ tidyr 1.3.1
✔ purrr 1.0.4
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
here() starts at /Users/jayhesselberth/devel/rnabioco/molb-7950
Now import the dataset data_transcript_exp_subset
using the readr package. Use read_csv()
to import the file.
The file is located at data/data_transcript_exp_subset.csv.gz
- use here()
to create the complete path.
x <- read_csv ( here ( "data/bootcamp/data_transcript_exp_subset.csv.gz" ) )
Rows: 100 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): ensembl_transcript_id
dbl (6): rna_0h_rep1, rna_0h_rep2, rna_0h_rep3, rna_14h_rep1, rna_14h_rep2, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Question 2
Explore the dataset. Is this dataset tidy? If not, why not?
This data frame is a subset (100 lines) of transcript-level gene expression data where transcript abundance was measured at two different time points of a certain treatment conducted in triplicates. The column names have the format of molecule_time_replicate
First, explore the structure of the dataset using some of the functions we learned in class. Try using glimpse()
, summary()
, and names()
to understand the data structure.
# A tibble: 100 × 7
ensembl_transcript_id rna_0h_rep1 rna_0h_rep2 rna_0h_rep3 rna_14h_rep1
<chr> <dbl> <dbl> <dbl> <dbl>
1 ENST00000327044.6_51_2298 243 322 303 177
2 ENST00000338591.7_360_2034 19 17 15 9
3 ENST00000379389.4_176_647 45 53 48 11
4 ENST00000379370.6_1158_6186 42 50 52 32
5 ENST00000379339.5_212_1352 17 19 25 3
6 ENST00000263741.11_1328_1496 27.5 33.7 36.3 22.5
7 ENST00000360001.10_285_1350 158 170. 171. 121
8 ENST00000263741.11_315_1338 148. 162. 158. 116.
9 ENST00000379198.3_138_1002 11 21 23 6
10 ENST00000347370.6_475_1096 27.3 23.8 28.5 7.33
# ℹ 90 more rows
# ℹ 2 more variables: rna_14h_rep2 <dbl>, rna_14h_rep3 <dbl>
# Let's explore the data more systematically
# Look at the column names
names ( x )
[1] "ensembl_transcript_id" "rna_0h_rep1" "rna_0h_rep2"
[4] "rna_0h_rep3" "rna_14h_rep1" "rna_14h_rep2"
[7] "rna_14h_rep3"
ensembl_transcript_id rna_0h_rep1 rna_0h_rep2 rna_0h_rep3
Length:100 Min. : 0.00 Min. : 0.00 Min. : 0.00
Class :character 1st Qu.: 10.70 1st Qu.: 11.88 1st Qu.: 11.12
Mode :character Median : 27.41 Median : 31.05 Median : 31.91
Mean : 173.31 Mean : 196.08 Mean : 186.10
3rd Qu.: 87.08 3rd Qu.: 105.00 3rd Qu.: 88.33
Max. :9802.00 Max. :11144.00 Max. :10619.00
rna_14h_rep1 rna_14h_rep2 rna_14h_rep3
Min. : 0.000 Min. : 0.000 Min. : 0.000
1st Qu.: 3.875 1st Qu.: 3.962 1st Qu.: 5.000
Median : 10.435 Median : 9.665 Median : 9.665
Mean : 102.875 Mean : 93.370 Mean : 111.515
3rd Qu.: 41.000 3rd Qu.: 38.750 3rd Qu.: 48.750
Max. :5292.000 Max. :5090.000 Max. :6012.000
Rows: 100
Columns: 7
$ ensembl_transcript_id <chr> "ENST00000327044.6_51_2298", "ENST00000338591.7_…
$ rna_0h_rep1 <dbl> 243.00, 19.00, 45.00, 42.00, 17.00, 27.50, 158.0…
$ rna_0h_rep2 <dbl> 322.00, 17.00, 53.00, 50.00, 19.00, 33.67, 169.6…
$ rna_0h_rep3 <dbl> 303.00, 15.00, 48.00, 52.00, 25.00, 36.33, 171.3…
$ rna_14h_rep1 <dbl> 177.00, 9.00, 11.00, 32.00, 3.00, 22.50, 121.00,…
$ rna_14h_rep2 <dbl> 177.00, 5.00, 5.00, 31.00, 0.00, 29.17, 124.17, …
$ rna_14h_rep3 <dbl> 239.00, 8.00, 14.00, 30.00, 2.00, 27.33, 155.33,…
Comment on whether this dataset is tidy, and if not, list the reasons why.
Hint: In a tidy dataframe, every column represents a single variable and every row represents a single observation
Answer
It is not tidy because: 1. The time points and replicates are not in their own columns 2. Multiple variables (molecule type, time, replicate) are encoded in column names 3. Each row contains multiple observations (different time points and replicates)
Question 3
How will you reshape the data frame so that each row has only one experimental observation?
Before we reshape, let’s think about what we want:
Which column should stay the same? (The transcript ID)
Which columns contain the measurements? (All the others)
What should we call the new column names?
Use pivot_longer()
to reshape the data. You’ll want to:
Keep the ensembl_transcript_id
column as-is (use cols = -ensembl_transcript_id
)
Create a new column for the condition names (use names_to = "condition"
)
Create a new column for the values (use values_to = "count"
)
# Reshape the data so each row is one observation
x_long <-
pivot_longer (
x ,
cols = - ensembl_transcript_id , # everything except the ID column
names_to = "condition" , # new column for the condition names
values_to = "count" # new column for the count values
)
x_long
# A tibble: 600 × 3
ensembl_transcript_id condition count
<chr> <chr> <dbl>
1 ENST00000327044.6_51_2298 rna_0h_rep1 243
2 ENST00000327044.6_51_2298 rna_0h_rep2 322
3 ENST00000327044.6_51_2298 rna_0h_rep3 303
4 ENST00000327044.6_51_2298 rna_14h_rep1 177
5 ENST00000327044.6_51_2298 rna_14h_rep2 177
6 ENST00000327044.6_51_2298 rna_14h_rep3 239
7 ENST00000338591.7_360_2034 rna_0h_rep1 19
8 ENST00000338591.7_360_2034 rna_0h_rep2 17
9 ENST00000338591.7_360_2034 rna_0h_rep3 15
10 ENST00000338591.7_360_2034 rna_14h_rep1 9
# ℹ 590 more rows
Question 4
How will you modify the dataframe so that multiple variables are not present in a single column?
Use separate_wider_delim()
to split the condition column into separate variables. You need to:
Specify which column to separate (condition
)
Specify the delimiter character (delim = "_"
)
Provide the new column names (names = c("molecule", "timepoint", "replicate")
)
x_tidy <-
separate_wider_delim (
x_long ,
condition ,
delim = "_" ,
names = c ( "molecule" , "timepoint" , "replicate" )
)
x_tidy
# A tibble: 600 × 5
ensembl_transcript_id molecule timepoint replicate count
<chr> <chr> <chr> <chr> <dbl>
1 ENST00000327044.6_51_2298 rna 0h rep1 243
2 ENST00000327044.6_51_2298 rna 0h rep2 322
3 ENST00000327044.6_51_2298 rna 0h rep3 303
4 ENST00000327044.6_51_2298 rna 14h rep1 177
5 ENST00000327044.6_51_2298 rna 14h rep2 177
6 ENST00000327044.6_51_2298 rna 14h rep3 239
7 ENST00000338591.7_360_2034 rna 0h rep1 19
8 ENST00000338591.7_360_2034 rna 0h rep2 17
9 ENST00000338591.7_360_2034 rna 0h rep3 15
10 ENST00000338591.7_360_2034 rna 14h rep1 9
# ℹ 590 more rows
Question 5
How will you save your output as a TSV file?
Use write_tsv()
from the readr package to save your tidy data. Provide the data object and a filename.
Hint: Use the readr cheatsheet at the bottom of this page to figure this out.
After running your new code, you should have a new file called transcripts.tidy.tsv
in your working directory.
Question 6
Can you reverse the process? How would you go from tidy back to wide format?
Use pivot_wider()
to go from the tidy format back to the original wide format. You need to:
Specify where the new column names come from (names_from = c(molecule, timepoint, replicate)
)
Specify where the values come from (values_from = count
)
Specify how to combine the names (names_sep = "_"
)
# Going back to wide format
pivot_wider (
x_tidy ,
names_from = c ( molecule , timepoint , replicate ) ,
values_from = count ,
names_sep = "_"
)
# A tibble: 100 × 7
ensembl_transcript_id rna_0h_rep1 rna_0h_rep2 rna_0h_rep3 rna_14h_rep1
<chr> <dbl> <dbl> <dbl> <dbl>
1 ENST00000327044.6_51_2298 243 322 303 177
2 ENST00000338591.7_360_2034 19 17 15 9
3 ENST00000379389.4_176_647 45 53 48 11
4 ENST00000379370.6_1158_6186 42 50 52 32
5 ENST00000379339.5_212_1352 17 19 25 3
6 ENST00000263741.11_1328_1496 27.5 33.7 36.3 22.5
7 ENST00000360001.10_285_1350 158 170. 171. 121
8 ENST00000263741.11_315_1338 148. 162. 158. 116.
9 ENST00000379198.3_138_1002 11 21 23 6
10 ENST00000347370.6_475_1096 27.3 23.8 28.5 7.33
# ℹ 90 more rows
# ℹ 2 more variables: rna_14h_rep2 <dbl>, rna_14h_rep3 <dbl>