R Bootcamp Problem Set 4

Author

Your name here

Published

October 21, 2024

Problem Set

Use the data files in the data/ directory to answer the questions.

For this problem set, you are allowed to help each other, but you are not allowed to post correct answers in slack.

The problem set is due 12pm on Sept 1.

Grading rubric

Everything is good: full points
Partially correct answer: depends on how many steps are correct
Reasonable attempt: half points

Question 1 5 points

Load the tidyverse and here packages.
Import datasets: data/data_rna_protein.csv.gz.

data_rna_protein.csv.gz: This is a combined dataset from an RNAseq and SILAC proteomics experiment, where a transcription factor (TF) was differentially expressed and the fold change in RNA and protein calculated between TF-expressing and non-expressing cells.

library(tidyverse)

-- Attaching core tidyverse packages ------------------------ tidyverse 2.0.0 --
v dplyr     1.1.3     v readr     2.1.4
v forcats   1.0.0     v stringr   1.5.0
v ggplot2   3.4.3     v tibble    3.2.1
v lubridate 1.9.2     v tidyr     1.3.0
v purrr     1.0.2     
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()
i Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(here)

here() starts at /Users/jayhesselberth/devel/rnabioco/molb-7950

exp_tbl <- read_csv(
  here("problem-sets/data/data_rna_protein.csv.gz")
)

Rows: 21282 Columns: 17
-- Column specification --------------------------------------------------------
Delimiter: ","
chr  (1): geneid
dbl (16): iDUX4_logFC, iDUX4_logCPM, iDUX4_LR, iDUX4_pval, iDUX4_fdr, hl.rat...

i Use `spec()` to retrieve the full column specification for this data.
i Specify the column types or set `show_col_types = FALSE` to quiet this message.

Question 2 5 points

Using the imported data set, carry out the following:

Inspect the data so you know what you are dealing with (summary() etc).
Select only the following columns: geneid, iDUX4_logFC, iDUX4_fdr, hl.ratio, and pval.
Rename them as follows: rna_FC = iDUX4_logFC, rna_pval = iDUX4_fdr, protein_FC = hl.ratio, protein_pval = pval (hint: use dplyr::rename())
Drop all rows with NA values in them (hint: use a function from tidyr)
Remove duplicate rows (hint: use dplyr::distinct()).
Arrange the table by descending rna_FC and ascending protein_FC.
Conduct steps 2-7 by piping the output of one step to another (i.e, a single workflow & remember to comment).
Save the output of this workflow into a new object.

exp_tbl_subset <- exp_tbl |>
  select(geneid, iDUX4_logFC, iDUX4_fdr, hl.ratio, pval) |>
  rename(
    rna_FC = iDUX4_logFC,
    rna_pval = iDUX4_fdr,
    protein_FC = hl.ratio,
    protein_pval = pval
  ) |>
  drop_na() |>
  distinct() |>
  arrange(desc(rna_FC), protein_FC)

Question 3 5 points

How well do the overall rna_FC and protein_FC values correlate in this experiment?

Using the output from the above question, do the following:

Create a scatter plot of rna_FC vs protein_FC. observe how the points scatter.
Add a line to the plot that would indicate perfect 1:1 correlation. Hint: use geom_abline() with its slope argument.
Add a linear model fit using geom_smooth() (method = 'lm'). Observe how the x=y line deviates from your geom_smooth line.
Calculate the Spearman correlation coefficient. (Hint: This uses a base R math function called cor - Use help() or Google to learn more and how to specify method as spearman)
Using all of the information from above, comment on the correlation between rna_FC and protein_FC below.

ggplot(exp_tbl_subset, aes(rna_FC, protein_FC)) +
  geom_point() +
  geom_abline(linewidth = 1, color = "green") +
  geom_smooth(method = "lm", se = FALSE, linewidth = 1)

`geom_smooth()` using formula = 'y ~ x'

rna_prot_cor <- cor(
  exp_tbl_subset$rna_FC,
  exp_tbl_subset$protein_FC,
  method = "spearman"
)

Answer

The green line indicates a perfect correlation, and the blue line is the linear model fit of the data. The Spearman correlation is 0.346, indicating a strong positive correlation. One way to think about this is that there are 0.346 proteins made per mRNA.

Submit

Be sure to click the “Render” button to render the HTML output.

Then paste the URL of this Posit Cloud project into the problem set on Canvas.