R Bootcamp Problem Set 4

Author

Your name here

Published

September 6, 2025

Problem Set

Use the data files in the data/ directory to answer the questions.

For this problem set, you are allowed to help each other, but you are not allowed to post correct answers in slack.

The problem set is due 5pm on Aug 29

Grading rubric

  • Everything is good: full points
  • Partially correct answer: depends on how many steps are correct
  • Reasonable attempt: half points

Question 1 5 points

  1. Load the tidyverse and here packages using library().
  2. Import datasets: data/data_rna_protein.csv.gz using read_csv() and here().

data_rna_protein.csv.gz: This is a combined dataset from an RNAseq and SILAC proteomics experiment, where a transcription factor (TF) was differentially expressed and the fold change in RNA and protein calculated between TF-expressing and non-expressing cells.

library(tidyverse)
library(here)

exp_tbl <- read_csv(
  here(___)
)

Question 2 5 points

Let’s build a data processing workflow step by step. This teaches you how to build complex pipelines gradually - a key skill in data analysis.

Step 1: First, explore the data so you know what you’re working with. Use glimpse() to see column types and summary() to see distributions:

# Always explore your data first!

Step 2: Select only the columns we need:

  • geneid (gene identifier)
  • iDUX4_logFC (RNA fold change)
  • iDUX4_fdr (RNA pvalue)
  • hl.ratio (protein fold change)
  • pval (protein pvalue)

Use select() and list the columns you want to keep:

exp_tbl |>
  select(___)

Step 3: Rename columns for clarity (this makes your code more readable).

Use dplyr::rename() with the pattern new_name = old_name, ...:

exp_tbl |>
  select(___) |>
  rename(
    ___ = ___,
    # etc
  )

Step 4: Clean the data by removing rows with missing values. Use drop_na() to remove rows with any missing values, and distinct() to remove duplicate rows:

exp_tbl |>
  select(___) |>
  rename(
    ___ = ___,
    # etc
  ) |>
  ___() |> # Remove rows with any missing values
  ___() # Remove duplicate rows

Step 5: Finally, arrange the data and save it. Use arrange() to sort by RNA fold change (high to low), then protein fold change (low to high):

exp_tbl_subset <- exp_tbl |>
  select(___) |>
  rename(
    ___ = ___,
    # etc
  ) |>
  ___() |> # Remove rows with any missing values
  ___() |> # Remove duplicate rows
  # Sort by RNA fold change (high to low), then protein fold change (low to high)
  ___(___, ___)

exp_tbl_subset

Question 3 5 points

Let’s practice good data analysis habits by checking for potential issues. Quality control is essential in real data analysis.

Check for duplicates and missing values:

  1. Use count() to check for duplicate genes
  2. Use summarize() with across() to count missing values in all columns
  3. Use summary statistics to understand data distributions
# Check for duplicate genes (there shouldn't be any after distinct())
exp_tbl_subset |>
  count(___) |>
  ___(n > 1) # Any genes appearing more than once?
# Summary of missing values by column
exp_tbl_subset |>
  summarize(
    # first blank select variables
    # second blank applies a function to count NA values
    across(___, ___)
  )
# Look at the distribution of our main variables
exp_tbl_subset |>
  summarize(
    across(
      # specify the variables to summarize
      ___,
      list(
        # mean
        mean = ~ mean(., na.rm = TRUE),
        # now do median
        ___ = ~ ___(., na.rm = TRUE),
        # and sd
        ___ = ~ ___(., na.rm = TRUE)
      )
    ),
    .groups = "drop"
  )

Question 4 5 points

How well do the overall rna_FC and protein_FC values correlate in this experiment? We’ll explore this with visualization and statistics.

Step 1: Create a scatter plot of rna_FC vs protein_FC using ggplot(). Use:

  • aes() to map x and y variables
  • geom_point() to create the scatter plot
  • labs() to add informative axis labels and title
ggplot(
  ___,
  aes(
    x = ___,
    y = ___
  )
) +
  # ad points
  ___() +
  # add labels
  labs(
    x = "___",
    y = "___",
    title = "___"
  )

Step 2: Add reference lines to help interpret the correlation. Use:

  • geom_abline(slope = 1, intercept = 0) for perfect correlation line
  • geom_smooth(method = "lm", se = FALSE) for the computed trend line
  • adjust the geom_point() aesthetic to alpha = 0.6, making points slightly transparent for better visualization
ggplot(
  ___,
  aes(
    x = ___,
    y = ___
  )
) +
  # Add transparent points (change the ???)
  geom_???(alpha = 0.6) +
  # Add the perfect correlation line (change the ???)
  geom_???(slope = 1, intercept = 0, color = "red", linewidth = 1) +
  # Add the computed trend line (change the ???)
  geom_???(method = "lm", se = FALSE, color = "blue", linewidth = 1) +
  labs(
    x = "___",
    y = "___",
    title = "___"
  )

Step 3: Calculate the correlation coefficient using cor(). Use Spearman correlation since it’s robust to outliers. Use ?cor to see the function documentation. You will need to specify two vectors for the calculation, and it’s easiest to provide them using the $ operator to extract columns from the data frame.

rna_prot_cor <- cor(
  # specify the first vector
  ___,
  # specify the second vector
  ___,
  method = "spearman"
)

rna_prot_cor

Answer

[ YOUR ANSWER HERE ]

Submit

Be sure to click the “Render” button to render the HTML output.

Then paste the URL of this Posit Cloud project into the problem set on Canvas.