library(tidyverse)
library(here)
<- read_csv(
exp_tbl here(___)
)
R Bootcamp Problem Set 4
Problem Set
Use the data files in the data/
directory to answer the questions.
For this problem set, you are allowed to help each other, but you are not allowed to post correct answers in slack.
The problem set is due 5pm on Aug 29
Grading rubric
- Everything is good: full points
- Partially correct answer: depends on how many steps are correct
- Reasonable attempt: half points
Question 1 5 points
- Load the tidyverse and here packages using
library()
. - Import datasets:
data/data_rna_protein.csv.gz
usingread_csv()
andhere()
.
data_rna_protein.csv.gz
: This is a combined dataset from an RNAseq and SILAC proteomics experiment, where a transcription factor (TF) was differentially expressed and the fold change in RNA and protein calculated between TF-expressing and non-expressing cells.
Question 2 5 points
Let’s build a data processing workflow step by step. This teaches you how to build complex pipelines gradually - a key skill in data analysis.
Step 1: First, explore the data so you know what you’re working with. Use glimpse()
to see column types and summary()
to see distributions:
# Always explore your data first!
Step 2: Select only the columns we need:
-
geneid
(gene identifier) -
iDUX4_logFC
(RNA fold change) -
iDUX4_fdr
(RNA pvalue) -
hl.ratio
(protein fold change) -
pval
(protein pvalue)
Use select()
and list the columns you want to keep:
|>
exp_tbl select(___)
Step 3: Rename columns for clarity (this makes your code more readable).
Use dplyr::rename()
with the pattern new_name = old_name, ...
:
|>
exp_tbl select(___) |>
rename(
___ = ___,
# etc
)
Step 4: Clean the data by removing rows with missing values. Use drop_na()
to remove rows with any missing values, and distinct()
to remove duplicate rows:
|>
exp_tbl select(___) |>
rename(
___ = ___,
# etc
|>
) ___() |> # Remove rows with any missing values
___() # Remove duplicate rows
Step 5: Finally, arrange the data and save it. Use arrange()
to sort by RNA fold change (high to low), then protein fold change (low to high):
<- exp_tbl |>
exp_tbl_subset select(___) |>
rename(
___ = ___,
# etc
|>
) ___() |> # Remove rows with any missing values
___() |> # Remove duplicate rows
# Sort by RNA fold change (high to low), then protein fold change (low to high)
___(___, ___)
exp_tbl_subset
Question 3 5 points
Let’s practice good data analysis habits by checking for potential issues. Quality control is essential in real data analysis.
Check for duplicates and missing values:
- Use
count()
to check for duplicate genes - Use
summarize()
withacross()
to count missing values in all columns - Use summary statistics to understand data distributions
# Check for duplicate genes (there shouldn't be any after distinct())
|>
exp_tbl_subset count(___) |>
___(n > 1) # Any genes appearing more than once?
# Summary of missing values by column
|>
exp_tbl_subset summarize(
# first blank select variables
# second blank applies a function to count NA values
across(___, ___)
)
# Look at the distribution of our main variables
|>
exp_tbl_subset summarize(
across(
# specify the variables to summarize
___,list(
# mean
mean = ~ mean(., na.rm = TRUE),
# now do median
___ = ~ ___(., na.rm = TRUE),
# and sd
___ = ~ ___(., na.rm = TRUE)
)
),.groups = "drop"
)
Question 4 5 points
How well do the overall rna_FC
and protein_FC
values correlate in this experiment? We’ll explore this with visualization and statistics.
Step 1: Create a scatter plot of rna_FC
vs protein_FC
using ggplot()
. Use:
-
aes()
to map x and y variables -
geom_point()
to create the scatter plot -
labs()
to add informative axis labels and title
ggplot(
___,aes(
x = ___,
y = ___
)+
) # ad points
___() +
# add labels
labs(
x = "___",
y = "___",
title = "___"
)
Step 2: Add reference lines to help interpret the correlation. Use:
-
geom_abline(slope = 1, intercept = 0)
for perfect correlation line -
geom_smooth(method = "lm", se = FALSE)
for the computed trend line - adjust the
geom_point()
aesthetic toalpha = 0.6
, making points slightly transparent for better visualization
ggplot(
___,aes(
x = ___,
y = ___
)+
) # Add transparent points (change the ???)
alpha = 0.6) +
geom_???(# Add the perfect correlation line (change the ???)
slope = 1, intercept = 0, color = "red", linewidth = 1) +
geom_???(# Add the computed trend line (change the ???)
method = "lm", se = FALSE, color = "blue", linewidth = 1) +
geom_???(labs(
x = "___",
y = "___",
title = "___"
)
Step 3: Calculate the correlation coefficient using cor()
. Use Spearman correlation since it’s robust to outliers. Use ?cor
to see the function documentation. You will need to specify two vectors for the calculation, and it’s easiest to provide them using the $
operator to extract columns from the data frame.
<- cor(
rna_prot_cor # specify the first vector
___,# specify the second vector
___,method = "spearman"
)
rna_prot_cor
Answer
[ YOUR ANSWER HERE ]
Submit
Be sure to click the “Render” button to render the HTML output.
Then paste the URL of this Posit Cloud project into the problem set on Canvas.