Code
# Load required libraries
Advanced Modeling and Functional Analysis
In this problem set, you’ll work with the Brauer gene expression dataset to practice comprehensive tidyverse skills including data tidying, transformation, joins, pivoting, string manipulation, and statistical modeling using broom. The dataset contains gene expression measurements for yeast genes under different nutrient limitations and growth rates.
Before we start tidying and analyzing the data, take a moment to predict what you might find.
# Load required libraries
Task 1: Load the raw Brauer gene expression data and examine its structure. What makes this data “messy” or untidy?
Breadcrumbs: Use read_tsv()
to load the data from the URL. Examine column names and the first few rows. Think about tidy data principles - what issues do you see with the current format?
# Load the Brauer gene expression data
# URL: "https://github.com/rnabioco/molb-7950/raw/refs/heads/main/data/bootcamp/brauer_gene_exp_raw.tsv.gz"
# Examine the structure of the data
We want to create a table that looks like this:
# A tibble: 199,296 × 4
systematic_name nutrient rate exp_level
<chr> <fct> <dbl> <dbl>
1 YPR204W Glucose 0.05 1.17
2 YPR204W Glucose 0.1 1
3 YPR204W Glucose 0.15 0.86
4 YPR204W Glucose 0.2 0.77
5 YPR204W Glucose 0.25 0.53
6 YPR204W Glucose 0.3 0.3
7 YPR204W Ammonia 0.05 2.79
8 YPR204W Ammonia 0.1 2
9 YPR204W Ammonia 0.15 0.6
10 YPR204W Ammonia 0.2 0.16
# ℹ 199,286 more rows
# ℹ Use `print(n = ...)` to see more rows
Task 2: The NAME column contains multiple pieces of information separated by “||”. Split this into meaningful columns.
Breadcrumbs: Use separate_wider_delim()
to split the NAME column. You’ll want columns for gene name, biological process, molecular function, systematic name, and number. Don’t forget to clean up whitespace and handle empty strings.
# Separate the NAME column into meaningful components
Task 3: Transform the wide-format expression data into a long format suitable for analysis.
Breadcrumbs: First select the relevant columns (systematic_name and the expression columns). Then use pivot_longer()
to convert expression columns to rows. The column names contain both nutrient type and growth rate information.
# Transform to long format
Task 4: Extract nutrient type and growth rate from the column names in your long dataset.
Breadcrumbs: The column names follow a pattern like “G0.05” where the first character is the nutrient abbreviation and the rest is the growth rate. Use separate_wider_position()
to split these. Create a lookup table for nutrient abbreviations.
# Extract nutrient and growth rate information
# Create nutrient lookup table
Task 5: Remove any rows with missing systematic names and add meaningful nutrient names.
Breadcrumbs: Use filter()
to remove empty systematic names. Create a nutrient lookup table and use left_join()
to add full nutrient names. Convert appropriate columns to factors.
# Filter and clean the data
Task 6: Calculate summary statistics for gene expression by nutrient type.
Breadcrumbs: Use group_by()
and summarize()
to calculate mean, median, and standard deviation of expression values for each nutrient. Which nutrients show the highest variability in expression?
# Calculate summary statistics by nutrient
Task 7: Find genes with extreme expression values under different conditions.
Breadcrumbs: For each nutrient-rate combination, identify the top 5 highest and lowest expressing genes. Use slice_max()
and slice_min()
or ranking functions. What patterns do you notice?
# Find genes with extreme expression values