Use the data files in the data/ directory to answer the questions.
For this problem set, you are allowed to help each other, but you are not allowed to post correct answers in slack.
The problem set is due 12pm on Sept 1.
Grading rubric
Everything is good: full points
Partially correct answer: depends on how many steps are correct
Reasonable attempt: half points
Question 1 5 points
Load the tidyverse and here packages.
Import datasets: data/data_rna_protein.csv.gz.
data_rna_protein.csv.gz: This is a combined dataset from an RNAseq and SILAC proteomics experiment, where a transcription factor (TF) was differentially expressed and the fold change in RNA and protein calculated between TF-expressing and non-expressing cells.
-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr 1.1.3 v readr 2.1.4
v forcats 1.0.0 v stringr 1.5.0
v ggplot2 3.4.3 v tibble 3.2.1
v lubridate 1.9.2 v tidyr 1.3.0
v purrr 1.0.2
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag() masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Rows: 21282 Columns: 17
-- Column specification --------------------------------------------------------
Delimiter: ","
chr (1): geneid
dbl (16): iDUX4_logFC, iDUX4_logCPM, iDUX4_LR, iDUX4_pval, iDUX4_fdr, hl.rat...
i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Question 2 5 points
Using the imported data set, carry out the following:
Inspect the data so you know what you are dealing with (summary() etc).
Select only the following columns: geneid, iDUX4_logFC, iDUX4_fdr, hl.ratio, and pval.
Rename them as follows: rna_FC = iDUX4_logFC, rna_pval = iDUX4_fdr, protein_FC = hl.ratio, protein_pval = pval (hint: use dplyr::rename())
Drop all rows with NA values in them (hint: use a function from tidyr)
How well do the overall rna_FC and protein_FC values correlate in this experiment?
Using the output from the above question, do the following:
Create a scatter plot of rna_FC vs protein_FC. observe how the points scatter.
Add a line to the plot that would indicate perfect 1:1 correlation. Hint: use geom_abline() with its slope argument.
Add a linear model fit using geom_smooth() (method = 'lm'). Observe how the x=y line deviates from your geom_smooth line.
Calculate the Spearman correlation coefficient. (Hint: This uses a base R math function called cor - Use help() or Google to learn more and how to specify method as spearman)
Using all of the information from above, comment on the correlation between rna_FC and protein_FC below.
The green line indicates a perfect correlation, and the blue line is the linear model fit of the data. The Spearman correlation is 0.346, indicating a strong positive correlation. One way to think about this is that there are 0.346 proteins made per mRNA.
Submit
Be sure to click the “Render” button to render the HTML output.
Then paste the URL of this Posit Cloud project into the problem set on Canvas.