Exercises 8: Gene Expression Analysis with Tidyverse

Advanced Modeling and Functional Analysis

Author

Your Name

Published

October 20, 2025

1 Overview

In this problem set, you’ll work with the Brauer gene expression dataset to practice comprehensive tidyverse skills including data tidying, transformation, joins, pivoting, string manipulation, and statistical modeling using broom. The dataset contains gene expression measurements for yeast genes under different nutrient limitations and growth rates.

1.1 Predictions

Before we start tidying and analyzing the data, take a moment to predict what you might find.

Question: What patterns do you expect to see in gene expression across different nutrients and growth rates?
Hypothesis: Genes involved in nutrient uptake and metabolism will show higher expression under their respective limiting conditions.
Hypohtesis: The different nutrient conditions will cause distinct gene expression profiles.

2 Setup and Data Loading

2.1 Load Required Libraries

Code

# Load required libraries

2.2 Load the Data

Task 1: Load the raw Brauer gene expression data and examine its structure. What makes this data “messy” or untidy?

Breadcrumbs: Use read_tsv() to load the data from the URL. Examine column names and the first few rows. Think about tidy data principles - what issues do you see with the current format?

Code

#|label: data-loading-02

# Load the Brauer gene expression data
url <- "https://github.com/rnabioco/molb-7950/raw/refs/heads/main/data/bootcamp/brauer_gene_exp_raw.tsv.gz"

Code

# Examine the structure of the data

3 Part 1: Data Tidying (tidyr)

We want to create a tibble that looks like this:

# A tibble: 199,296 × 4
   systematic_name nutrient  rate exp_level
   <chr>           <fct>    <dbl>     <dbl>
 1 YPR204W         Glucose   0.05      1.17
 2 YPR204W         Glucose   0.1       1
 3 YPR204W         Glucose   0.15      0.86
 4 YPR204W         Glucose   0.2       0.77
 5 YPR204W         Glucose   0.25      0.53
 6 YPR204W         Glucose   0.3       0.3
 7 YPR204W         Ammonia   0.05      2.79
 8 YPR204W         Ammonia   0.1       2
 9 YPR204W         Ammonia   0.15      0.6
10 YPR204W         Ammonia   0.2       0.16
# ℹ 199,286 more rows
# ℹ Use `print(n = ...)` to see more rows

Note the classes of each column - systematic_name is character, nutrient is a factor, rate is numeric, and exp_level is numeric.

3.1 Create a Tidy Dataset

Task 3: Transform the wide-format expression data into a long format suitable for analysis.

Breadcrumbs:

First select the relevant columns (systematic_name and the expression columns (G0.05, etc))
Then use pivot_longer() to convert expression columns to rows
The column names contain both nutrient type (G) and growth rate (0.05) information - use separate_wider_position() to split the first character (nutrient abbreviation) from the numeric rate. The key here is to specify widths = c(nutrient_abbr = 1, rate = 4) to split after the first character, and put the rest into rate. Not all the rates have 4 characters, so use too_few = "align_start" to handle that.

Create a nutrient lookup table to convert abbreviations to full names. The lookup table should look like this:

# A tibble: 6 × 2
  nutrient_abbr nutrient
  <chr>         <chr>
1 G             Glucose
2 N             Ammonia
3 P             Phosphate
4 S             Sulfate
5 L             Leucine
6 U             Uracil

Use that table with a join function to add full nutrient names to the tidied data.
Convert the rate column to numeric and select the final columns in the desired order
Remove any NA values from systematic_name and exp_level using filter()

4 Part 2: Expression patterns

This is a “cheat” chunk to load the tidy data directly if you want to skip ahead. Otherwise, use the code above to create the tidy data from the raw data.

Code

brauer_tidy_tbl <- read_tsv(
  "https://github.com/rnabioco/molb-7950/raw/refs/heads/main/data/bootcamp/brauer_gene_exp_tidy.tsv.gz"
)

yeast_gene_info_tbl <- read_tsv(
  "https://github.com/rnabioco/molb-7950/raw/refs/heads/main/data/bootcamp/yeast_go_terms.tsv.gz"
)

yeast_protein_props_tbl <- read_tsv(
  "https://github.com/rnabioco/molb-7950/raw/refs/heads/main/data/bootcamp/yeast_prot_prop.tsv.gz"
)

4.1 Explore Expression Patterns

Task 6: Calculate summary statistics for gene expression by nutrient type.

Breadcrumbs: - Use group_by() and summarize() to calculate mean, median, and standard deviation of expression values for each nutrient. Which nutrients show the highest variability in expression?*

Code

# Calculate summary statistics by nutrient

Code

# Find genes with extreme expression values

Inspect the results and note any patterns you observe in high and low expression genes across different nutrient-rate combinations.

Next, make a boxplot to visualize the distribution of expression levels for each nutrient condition. What insights can you draw from the plot?

Code

#|

4.2 Identify High and Low Expression

Task 7: Find genes with extreme expression values under different conditions.

Breadcrumbs: For each nutrient-rate combination, identify the top 5 highest and lowest expressing genes. Use slice_max() and slice_min() or ranking functions. What patterns do you notice?

Code

# Find genes with extreme expression values

5 Part 3: Statistical Modeling with broom

5.1 Linear Models for Individual Genes

Task 8: Fit linear models to examine how each gene’s expression responds to growth rate within each nutrient condition.

Breadcrumbs:

Use reframe() with broom::tidy(lm(exp_level ~ rate)) for each gene-nutrient combination
Use .by = c(nutrient, systematic_name) to group the modeling
This creates a tidy data frame of model coefficients

Code

# Fit linear models for each gene-nutrient combination
brauer_models_tbl <-
  brauer_tidy_tbl |>
  reframe(
    broom::tidy(lm(exp_level ~ rate)),
    .by = c(nutrient, systematic_name)
  )

The above should give you a tibble that looks like this:

# A tibble: 66,430 × 7
   nutrient  systematic_name term        estimate std.error statistic p.value
   <chr>     <chr>           <chr>          <dbl>     <dbl>     <dbl>   <dbl>
 1 Glucose   YNL049C         (Intercept)   -0.263    0.0441    -5.98  0.00394
 2 Glucose   YNL049C         rate           0.714    0.226      3.16  0.0343
 3 Ammonia   YNL049C         (Intercept)    0.153    0.247      0.619 0.569
 4 Ammonia   YNL049C         rate          -1.09     1.27      -0.862 0.437
 5 Phosphate YNL049C         (Intercept)   -0.465    0.161     -2.89  0.0444
 6 Phosphate YNL049C         rate           2.07     0.825      2.51  0.0657
 7 Sulfate   YNL049C         (Intercept)   -0.419    0.157     -2.67  0.0558
 8 Sulfate   YNL049C         rate           2.23     0.807      2.77  0.0503
 9 Leucine   YNL049C         (Intercept)    0.193    0.0313     6.16  0.00352
10 Leucine   YNL049C         rate          -0.177    0.161     -1.10  0.332
# ℹ 66,420 more rows
# ℹ Use `print(n = ...)` to see more rows

5.2 Analyze Slope Coefficients

Task 9: Examine the slope terms to identify genes that significantly respond to growth rate changes.

Breadcrumbs: Filter for slope terms (not intercepts). Use q-value correction for multiple testing. Create histograms of p-values by nutrient. Which genes show the strongest positive or negative relationships with growth rate?

Code

# Analyze slope coefficients

5.3 Analyze Intercept Terms

Task 10: Use intercept terms to identify genes with unusual baseline expression under specific nutrient limitations.

Breadcrumbs: Filter for intercept terms. Center intercepts around each gene’s mean across nutrients using group_by() and mutate(). Use top_n() to find genes with extreme centered intercepts.

Code

# Analyze intercept terms

6 Your assignment

Use this above data to try to ask and answer your own biological question.

6.1 Hypothesis

State one or more hypotheses you want to test using this data.

6.2 Explore / Analyze

Generate some tables or plots to explore your hypothesis.

6.3 Interpreation

Interpret your results. What do they mean in the context of your hypothesis? Were you able to support or refute your hypothesis?