Exercises 8: Gene Expression Analysis with Tidyverse

Advanced Modeling and Functional Analysis

Author

Your Name

Published

September 6, 2025

1 Overview

In this problem set, you’ll work with the Brauer gene expression dataset to practice comprehensive tidyverse skills including data tidying, transformation, joins, pivoting, string manipulation, and statistical modeling using broom. The dataset contains gene expression measurements for yeast genes under different nutrient limitations and growth rates.

1.1 Predictions

Before we start tidying and analyzing the data, take a moment to predict what you might find.

  1. Question: What patterns do you expect to see in gene expression across different nutrients and growth rates?
  2. Hypothesis: Genes involved in nutrient uptake and metabolism will show higher expression under their respective limiting conditions.

2 Setup and Data Loading

2.1 Load Required Libraries

Code
# Load required libraries

2.2 Load the Data

Task 1: Load the raw Brauer gene expression data and examine its structure. What makes this data “messy” or untidy?

Breadcrumbs: Use read_tsv() to load the data from the URL. Examine column names and the first few rows. Think about tidy data principles - what issues do you see with the current format?

Code
# Load the Brauer gene expression data
# URL: "https://github.com/rnabioco/molb-7950/raw/refs/heads/main/data/bootcamp/brauer_gene_exp_raw.tsv.gz"
Code
# Examine the structure of the data

3 Part 1: Data Tidying (tidyr)

We want to create a table that looks like this:

# A tibble: 199,296 × 4
   systematic_name nutrient  rate exp_level
   <chr>           <fct>    <dbl>     <dbl>
 1 YPR204W         Glucose   0.05      1.17
 2 YPR204W         Glucose   0.1       1
 3 YPR204W         Glucose   0.15      0.86
 4 YPR204W         Glucose   0.2       0.77
 5 YPR204W         Glucose   0.25      0.53
 6 YPR204W         Glucose   0.3       0.3
 7 YPR204W         Ammonia   0.05      2.79
 8 YPR204W         Ammonia   0.1       2
 9 YPR204W         Ammonia   0.15      0.6
10 YPR204W         Ammonia   0.2       0.16
# ℹ 199,286 more rows
# ℹ Use `print(n = ...)` to see more rows

3.1 Separate the NAME Column

Task 2: The NAME column contains multiple pieces of information separated by “||”. Split this into meaningful columns.

Breadcrumbs: Use separate_wider_delim() to split the NAME column. You’ll want columns for gene name, biological process, molecular function, systematic name, and number. Don’t forget to clean up whitespace and handle empty strings.

Code
# Separate the NAME column into meaningful components

3.2 Create a Tidy Dataset

Task 3: Transform the wide-format expression data into a long format suitable for analysis.

Breadcrumbs: First select the relevant columns (systematic_name and the expression columns). Then use pivot_longer() to convert expression columns to rows. The column names contain both nutrient type and growth rate information.

Code
# Transform to long format

3.3 Parse Nutrient and Rate Information

Task 4: Extract nutrient type and growth rate from the column names in your long dataset.

Breadcrumbs: The column names follow a pattern like “G0.05” where the first character is the nutrient abbreviation and the rest is the growth rate. Use separate_wider_position() to split these. Create a lookup table for nutrient abbreviations.

Code
# Extract nutrient and growth rate information
Code
# Create nutrient lookup table

4 Part 2: Data Transformation (dplyr)

4.1 Filter and Clean

Task 5: Remove any rows with missing systematic names and add meaningful nutrient names.

Breadcrumbs: Use filter() to remove empty systematic names. Create a nutrient lookup table and use left_join() to add full nutrient names. Convert appropriate columns to factors.

Code
# Filter and clean the data

4.2 Explore Expression Patterns

Task 6: Calculate summary statistics for gene expression by nutrient type.

Breadcrumbs: Use group_by() and summarize() to calculate mean, median, and standard deviation of expression values for each nutrient. Which nutrients show the highest variability in expression?

Code
# Calculate summary statistics by nutrient

4.3 Identify High and Low Expression

Task 7: Find genes with extreme expression values under different conditions.

Breadcrumbs: For each nutrient-rate combination, identify the top 5 highest and lowest expressing genes. Use slice_max() and slice_min() or ranking functions. What patterns do you notice?

Code
# Find genes with extreme expression values