R Bootcamp Problem Set 4

Author

Your name here

Published

September 6, 2025

Problem Set

Use the data files in the data/ directory to answer the questions.

For this problem set, you are allowed to help each other, but you are not allowed to post correct answers in slack.

The problem set is due 12pm on Sept 1.

Grading rubric

  • Everything is good: full points
  • Partially correct answer: depends on how many steps are correct
  • Reasonable attempt: half points

Question 1 5 points

  1. Load the tidyverse and here packages using library().
  2. Import datasets: data/data_rna_protein.csv.gz using read_csv() and here().

data_rna_protein.csv.gz: This is a combined dataset from an RNAseq and SILAC proteomics experiment, where a transcription factor (TF) was differentially expressed and the fold change in RNA and protein calculated between TF-expressing and non-expressing cells.

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.2     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.4     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
here() starts at /Users/jayhesselberth/devel/rnabioco/molb-7950
exp_tbl <- read_csv(
  here("data/bootcamp/data_rna_protein.csv.gz")
)
Rows: 21282 Columns: 17
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (1): geneid
dbl (16): iDUX4_logFC, iDUX4_logCPM, iDUX4_LR, iDUX4_pval, iDUX4_fdr, hl.rat...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Question 2 5 points

Let’s build a data processing workflow step by step. This teaches you how to build complex pipelines gradually - a key skill in data analysis.

Step 1: First, explore the data so you know what you’re working with. Use glimpse() to see column types and summary() to see distributions:

# Always explore your data first!
exp_tbl |> glimpse()
Rows: 21,282
Columns: 17
$ geneid         <chr> "RFPL1", "DUXA", "RFPL2", "LEUTX", "RFPL3S", "ZSCAN5C",…
$ iDUX4_logFC    <dbl> 9.366333, 8.728522, 8.582827, 8.136148, 8.031894, 7.837…
$ iDUX4_logCPM   <dbl> 8.568344, 9.740241, 9.760915, 8.702694, 5.563714, 6.532…
$ iDUX4_LR       <dbl> 2910.21184, 5195.45733, 4397.04659, 3276.44418, 392.901…
$ iDUX4_pval     <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 1.93e-87, 5.12e…
$ iDUX4_fdr      <dbl> 0.00e+00, 0.00e+00, 0.00e+00, 0.00e+00, 8.64e-86, 5.76e…
$ hl.ratio       <dbl> 4.415276, 8.919536, 3.032258, 7.151629, NA, 3.910093, N…
$ area           <dbl> 222742.2, 3523464.3, 119468.9, 1579141.5, NA, 307343.6,…
$ protein.length <dbl> 317, 204, 378, 168, NA, 496, NA, NA, 465, NA, 474, NA, …
$ rep            <dbl> 3.760000, 3.500000, 3.888889, 3.333333, NA, 4.000000, N…
$ IQR            <dbl> 1.952243, 6.435921, 1.694725, 5.303046, NA, 0.000000, N…
$ QCoD           <dbl> 0.4421564, 0.7215533, 0.5588985, 0.7415158, NA, 0.00000…
$ count          <dbl> 25, 32, 9, 6, NA, 1, NA, NA, 21, NA, 10, NA, 1, 80, NA,…
$ mean           <dbl> 0.004465732, -0.003224870, 0.024341945, -0.009047206, N…
$ sd             <dbl> 0.17197783, 0.13994973, 0.29149227, 0.38073424, NA, 2.0…
$ zscore         <dbl> 25.647554, 63.756902, 10.319026, 18.807544, NA, 1.89808…
$ pval           <dbl> 4.500000e-145, 0.000000e+00, 5.780000e-25, 6.550000e-79…
exp_tbl |> summary()
    geneid           iDUX4_logFC       iDUX4_logCPM       iDUX4_LR        
 Length:21282       Min.   :-3.4751   Min.   : 1.264   Min.   :    0.000  
 Class :character   1st Qu.:-0.9996   1st Qu.: 3.385   1st Qu.:    2.068  
 Mode  :character   Median :-0.3772   Median : 4.780   Median :    9.049  
                    Mean   :-0.1574   Mean   : 4.801   Mean   :   53.971  
                    3rd Qu.: 0.3977   3rd Qu.: 6.064   3rd Qu.:   30.949  
                    Max.   : 9.3663   Max.   :14.915   Max.   :13648.327  
                    NA's   :8950      NA's   :8950     NA's   :8950       
   iDUX4_pval       iDUX4_fdr         hl.ratio             area          
 Min.   :0.0000   Min.   :0.0000   Min.   :-19.6540   Min.   :     7952  
 1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.: -0.3842   1st Qu.:   290641  
 Median :0.0026   Median :0.0052   Median : -0.0529   Median :   545151  
 Mean   :0.1341   Mean   :0.1520   Mean   : -0.0306   Mean   :  1112496  
 3rd Qu.:0.1504   3rd Qu.:0.2002   3rd Qu.:  0.2283   3rd Qu.:  1014351  
 Max.   :0.9976   Max.   :0.9980   Max.   : 19.3091   Max.   :878105037  
 NA's   :8950     NA's   :8950     NA's   :15969      NA's   :15969      
 protein.length        rep             IQR               QCoD          
 Min.   :  19.0   Min.   :3.000   Min.   : 0.0000   Min.   :-7521.713  
 1st Qu.: 260.0   1st Qu.:3.333   1st Qu.: 0.3805   1st Qu.:   -2.960  
 Median : 433.0   Median :3.495   Median : 0.8014   Median :    0.000  
 Mean   : 603.4   Mean   :3.446   Mean   : 1.1079   Mean   :   -1.100  
 3rd Qu.: 728.0   3rd Qu.:3.560   3rd Qu.: 1.1911   3rd Qu.:    2.122  
 Max.   :8797.0   Max.   :4.000   Max.   :26.1119   Max.   : 3618.200  
 NA's   :15969    NA's   :15969   NA's   :15969     NA's   :15969      
     count             mean               sd             zscore        
 Min.   :   1.0   Min.   :-0.1496   Min.   :0.0152   Min.   :-77.7125  
 1st Qu.:   2.0   1st Qu.:-0.0024   1st Qu.:0.1453   1st Qu.: -1.1955  
 Median :   8.0   Median : 0.0024   Median :0.3121   Median : -0.1833  
 Mean   :  42.1   Mean   : 0.0239   Mean   :0.6794   Mean   :  0.2260  
 3rd Qu.:  31.0   3rd Qu.: 0.0315   3rd Qu.:1.3565   3rd Qu.:  0.7153  
 Max.   :2646.0   Max.   : 0.3272   Max.   :2.5400   Max.   :105.5495  
 NA's   :15969    NA's   :15969     NA's   :15969    NA's   :15969     
      pval       
 Min.   :0.0000  
 1st Qu.:0.0315  
 Median :0.3171  
 Mean   :0.3816  
 3rd Qu.:0.7035  
 Max.   :1.0000  
 NA's   :15969   

Step 2: Select only the columns we need:

  • geneid (gene identifier)
  • iDUX4_logFC (RNA fold change)
  • iDUX4_fdr (RNA pvalue)
  • hl.ratio (protein fold change)
  • pval (protein pvalue)

Use select() and list the columns you want to keep:

exp_tbl |>
  select(geneid, iDUX4_logFC, iDUX4_fdr, hl.ratio, pval)
# A tibble: 21,282 × 5
   geneid   iDUX4_logFC iDUX4_fdr hl.ratio       pval
   <chr>          <dbl>     <dbl>    <dbl>      <dbl>
 1 RFPL1           9.37 0             4.42  4.50e-145
 2 DUXA            8.73 0             8.92  0        
 3 RFPL2           8.58 0             3.03  5.78e- 25
 4 LEUTX           8.14 0             7.15  6.55e- 79
 5 RFPL3S          8.03 8.64e- 86    NA    NA        
 6 ZSCAN5C         7.84 5.76e-169     3.91  5.77e-  2
 7 USP29           7.71 4.48e- 36    NA    NA        
 8 FAM189A2        7.68 1.46e- 41    NA    NA        
 9 CCNA1           7.66 0             5.11  7.10e-169
10 ZNF280A         7.55 2.35e- 54    NA    NA        
# ℹ 21,272 more rows

Step 3: Rename columns for clarity (this makes your code more readable).

Use dplyr::rename() with the pattern new_name = old_name, ...:

exp_tbl |>
  select(geneid, iDUX4_logFC, iDUX4_fdr, hl.ratio, pval) |>
  rename(
    rna_FC = iDUX4_logFC,
    rna_pval = iDUX4_fdr,
    protein_FC = hl.ratio,
    protein_pval = pval
  )
# A tibble: 21,282 × 5
   geneid   rna_FC  rna_pval protein_FC protein_pval
   <chr>     <dbl>     <dbl>      <dbl>        <dbl>
 1 RFPL1      9.37 0               4.42    4.50e-145
 2 DUXA       8.73 0               8.92    0        
 3 RFPL2      8.58 0               3.03    5.78e- 25
 4 LEUTX      8.14 0               7.15    6.55e- 79
 5 RFPL3S     8.03 8.64e- 86      NA      NA        
 6 ZSCAN5C    7.84 5.76e-169       3.91    5.77e-  2
 7 USP29      7.71 4.48e- 36      NA      NA        
 8 FAM189A2   7.68 1.46e- 41      NA      NA        
 9 CCNA1      7.66 0               5.11    7.10e-169
10 ZNF280A    7.55 2.35e- 54      NA      NA        
# ℹ 21,272 more rows

Step 4: Clean the data by removing rows with missing values. Use drop_na() to remove rows with any missing values, and distinct() to remove duplicate rows:

exp_tbl |>
  select(geneid, iDUX4_logFC, iDUX4_fdr, hl.ratio, pval) |>
  rename(
    rna_FC = iDUX4_logFC,
    rna_pval = iDUX4_fdr,
    protein_FC = hl.ratio,
    protein_pval = pval
  ) |>
  drop_na() |> # Remove rows with any missing values
  distinct() # Remove duplicate rows
# A tibble: 4,931 × 5
   geneid   rna_FC  rna_pval protein_FC protein_pval
   <chr>     <dbl>     <dbl>      <dbl>        <dbl>
 1 RFPL1      9.37 0               4.42    4.50e-145
 2 DUXA       8.73 0               8.92    0        
 3 RFPL2      8.58 0               3.03    5.78e- 25
 4 LEUTX      8.14 0               7.15    6.55e- 79
 5 ZSCAN5C    7.84 5.76e-169       3.91    5.77e-  2
 6 CCNA1      7.66 0               5.11    7.10e-169
 7 PRAMEF1    7.54 0               4.82    3.49e- 76
 8 TPRX1      7.29 2.41e-132       7.35    2.85e-  4
 9 PRAMEF12   7.25 0               7.55    0        
10 RFPL4B     7.16 0               7.46    0        
# ℹ 4,921 more rows

Step 5: Finally, arrange the data and save it. Use arrange() to sort by RNA fold change (high to low), then protein fold change (low to high):

exp_tbl_subset <- exp_tbl |>
  select(geneid, iDUX4_logFC, iDUX4_fdr, hl.ratio, pval) |>
  rename(
    rna_FC = iDUX4_logFC,
    rna_pval = iDUX4_fdr,
    protein_FC = hl.ratio,
    protein_pval = pval
  ) |>
  drop_na() |>
  distinct() |>
  arrange(desc(rna_FC), protein_FC) # Sort by RNA fold change (high to low), then protein fold change (low to high)

exp_tbl_subset
# A tibble: 4,931 × 5
   geneid   rna_FC  rna_pval protein_FC protein_pval
   <chr>     <dbl>     <dbl>      <dbl>        <dbl>
 1 RFPL1      9.37 0               4.42    4.50e-145
 2 DUXA       8.73 0               8.92    0        
 3 RFPL2      8.58 0               3.03    5.78e- 25
 4 LEUTX      8.14 0               7.15    6.55e- 79
 5 ZSCAN5C    7.84 5.76e-169       3.91    5.77e-  2
 6 CCNA1      7.66 0               5.11    7.10e-169
 7 PRAMEF1    7.54 0               4.82    3.49e- 76
 8 TPRX1      7.29 2.41e-132       7.35    2.85e-  4
 9 PRAMEF12   7.25 0               7.55    0        
10 RFPL4B     7.16 0               7.46    0        
# ℹ 4,921 more rows

Question 3 5 points

Let’s practice good data analysis habits by checking for potential issues. Quality control is essential in real data analysis.

Check for duplicates and missing values:

  1. Use count() to check for duplicate genes
  2. Use summarize() with across() to count missing values in all columns
  3. Use summary statistics to understand data distributions
# Check for duplicate genes (there shouldn't be any after distinct())
dup_tbl <-
  exp_tbl_subset |>
  count(geneid) |>
  filter(n > 1) # Any genes appearing more than once?

There are 14 duplicate genes in the dataset.

# Summary of missing values by column
exp_tbl_subset |>
  summarize(
    across(everything(), ~ sum(is.na(.)))
  )
# A tibble: 1 × 5
  geneid rna_FC rna_pval protein_FC protein_pval
   <int>  <int>    <int>      <int>        <int>
1      0      0        0          0            0
# Look at the distribution of our main variables
exp_tbl_subset |>
  summarize(
    across(
      c(rna_FC, protein_FC),
      list(
        mean = ~ mean(., na.rm = TRUE),
        median = ~ median(., na.rm = TRUE),
        sd = ~ sd(., na.rm = TRUE)
      )
    ),
    .groups = "drop"
  )
# A tibble: 1 × 6
  rna_FC_mean rna_FC_median rna_FC_sd protein_FC_mean protein_FC_median
        <dbl>         <dbl>     <dbl>           <dbl>             <dbl>
1      -0.176        -0.309      1.12         -0.0376           -0.0543
# ℹ 1 more variable: protein_FC_sd <dbl>

For your reference, here are three different ways to write the same function inside across(). They all compute the mean while ignoring NA values.

exp_tbl_subset |>
  summarize(
    across(
      c(rna_FC, protein_FC),
      list(
        # Three different ways to write the same function
        mean1 = ~ mean(., na.rm = TRUE),
        mean2 = function(x) mean(, na.rm = TRUE),
        mean3 = \(x) mean(x, na.rm = TRUE),
      )
    )
  )

Question 4 5 points

How well do the overall rna_FC and protein_FC values correlate in this experiment? We’ll explore this with visualization and statistics.

Step 1: Create a scatter plot of rna_FC vs protein_FC using ggplot(). Use:

  • aes() to map x and y variables
  • geom_point() to create the scatter plot
  • labs() to add informative axis labels and title
ggplot(
  exp_tbl_subset,
  aes(
    x = rna_FC,
    y = protein_FC
  )
) +
  geom_point() +
  labs(
    x = "RNA Fold Change (log2)",
    y = "Protein Fold Change (log2)",
    title = "RNA vs Protein Expression Changes"
  )

Step 2: Add reference lines to help interpret the correlation. Use:

  • geom_abline(slope = 1, intercept = 0) for perfect correlation line
  • geom_smooth(method = "lm", se = FALSE) for the computed trend line
  • adjust the geom_point() aesthetic to alpha = 0.6, making points slightly transparent for better visualization
ggplot(
  exp_tbl_subset,
  aes(
    x = rna_FC,
    y = protein_FC
  )
) +
  geom_point(alpha = 0.6) + # Make points a bit transparent
  geom_abline(slope = 1, intercept = 0, color = "red", linewidth = 1) + # Perfect correlation line
  geom_smooth(method = "lm", se = FALSE, color = "blue", linewidth = 1) + # Actual relationship
  labs(
    x = "RNA Fold Change (log2)",
    y = "Protein Fold Change (log2)",
    title = "RNA vs Protein Expression Changes"
  )
`geom_smooth()` using formula = 'y ~ x'

Step 3: Calculate the correlation coefficient using cor(). Use Spearman correlation since it’s robust to outliers. Use ?cor to see the function documentation. You will need to specify two vectors for the calculation, and it’s easiest to provide them using the $ operator to extract columns from the data frame.

rna_prot_cor <- cor(
  exp_tbl_subset$rna_FC,
  exp_tbl_subset$protein_FC,
  method = "spearman"
)

rna_prot_cor
[1] 0.3458433

Answer

The green line shows perfect correlation (y = x), and the blue line shows the actual relationship in our data. The Spearman correlation is 0.346, indicating a strong positive correlation between RNA and protein changes, but not perfect correlation.

Submit

Be sure to click the “Render” button to render the HTML output.

Then paste the URL of this Posit Cloud project into the problem set on Canvas.