library(___)
library(___)
Problem Set 2 Key
Problem Set
Each problem below is worth 4 points.
Use the data files in the data/
directory to answer the questions.
For this problem set, you are allowed to help each other, but you are not allowed to post correct answers in slack.
The problem set is due 5pm on Aug 27.
Grading rubric
- Everything is good: 5 points
- Partially correct answers: 3-4 points
- Reasonable attempt: 2 points
Question 1
Start by loading the libraries you need for analysis below. When in doubt, start by loading the tidyverse package. You should also load here
.
Now import the dataset data_transcript_exp_subset
using the readr package. Use read_csv()
to import the file.
The file is located at data/data_transcript_exp_subset.csv.gz
- use here()
to create the complete path.
<- read_csv(___)
exp_tbl
exp_tbl
Question 2
Explore the dataset. Is this dataset tidy? If not, why not?
This data frame is a subset (100 lines) of transcript-level gene expression data where transcript abundance was measured at two different time points of a certain treatment conducted in triplicates. The column names have the format of molecule_time_replicate
First, explore the structure of the dataset using some of the functions we learned in class. Try using glimpse()
, summary()
, and names()
to understand the data structure.
Add more code chunks as needed to separate the different steps of your exploration.
x
names(x)
Comment on whether this dataset is tidy, and if not, list the reasons why.
Hint: In a tidy dataframe, every column represents a single variable and every row represents a single observation
Answer
[YOUR ANSWER HERE]
Question 3
How will you reshape the data frame so that each row has only one experimental observation?
Before we reshape, let’s think about what we want:
- Which column should stay the same? (The transcript ID)
- Which columns contain the measurements? (All the others)
- What should we call the new column names?
Use pivot_longer()
to reshape the data. You’ll want to:
- Keep the
ensembl_transcript_id
column as-is (cols
) - Create a new column for the condition names (
names_to
) - Create a new column for the values (
values_to
)
# Reshape the data so each row is one observation
<-
exp_tbl_long pivot_longer(
x,cols = ___,
names_to = ___,
values_to = ___
)
exp_tbl_long
Question 4
How will you modify the dataframe so that multiple variables are not present in a single column?
Use separate_wider_delim()
to split the condition column into separate variables. You need to:
- Specify which column to separate (
cols
) - Specify the delimiter character (
delim
) - Provide the new column names (
names
)
<-
exp_tbl_tidy separate_wider_delim(
exp_tbl_long,cols = ___,
delim = ___,
names = ___
)
exp_tbl_tidy
Question 5
How will you save your output as a TSV file?
Use write_tsv()
from the readr package to save your tidy data. Provide the data object and a filename.
Hint: Use the readr cheatsheet at the bottom of this page to figure this out.
After running your new code, you should have a new file called transcripts.tidy.tsv
in your working directory.
write_tsv(exp_tbl_tidy, "transcripts.tidy.tsv")
Question 6
Can you reverse the process? How would you go from tidy back to wide format?
Use pivot_wider()
to go from the tidy format back to the original wide format. You need to:
- Specify where the new column names come from (
names_from
) - Specify where the values come from (
values_from
) - Specify how to combine the names (
names_sep
)
pivot_wider(
exp_tbl_tidy,names_from = ___,
values_from = ___,
names_sep = ___
)
After this, your new data should look like the original tibble you started with.