Code
# Load required libraries
Advanced Modeling and Functional Analysis
In this problem set, you’ll work with the Brauer gene expression dataset to practice comprehensive tidyverse skills including data tidying, transformation, joins, pivoting, string manipulation, and statistical modeling using broom. The dataset contains gene expression measurements for yeast genes under different nutrient limitations and growth rates.
Before we start tidying and analyzing the data, take a moment to predict what you might find.
# Load required libraries
Task 1: Load the raw Brauer gene expression data and examine its structure. What makes this data “messy” or untidy?
Breadcrumbs: Use read_tsv()
to load the data from the URL. Examine column names and the first few rows. Think about tidy data principles - what issues do you see with the current format?
#|label: data-loading-02
# Load the Brauer gene expression data
url <- "https://github.com/rnabioco/molb-7950/raw/refs/heads/main/data/bootcamp/brauer_gene_exp_raw.tsv.gz"
# Examine the structure of the data
We want to create a tibble that looks like this:
# A tibble: 199,296 × 4
systematic_name nutrient rate exp_level<chr> <fct> <dbl> <dbl>
1 YPR204W Glucose 0.05 1.17
2 YPR204W Glucose 0.1 1
3 YPR204W Glucose 0.15 0.86
4 YPR204W Glucose 0.2 0.77
5 YPR204W Glucose 0.25 0.53
6 YPR204W Glucose 0.3 0.3
7 YPR204W Ammonia 0.05 2.79
8 YPR204W Ammonia 0.1 2
9 YPR204W Ammonia 0.15 0.6
10 YPR204W Ammonia 0.2 0.16
# ℹ 199,286 more rows
# ℹ Use `print(n = ...)` to see more rows
Note the classes of each column - systematic_name is character, nutrient is a factor, rate is numeric, and exp_level is numeric.
Task 3: Transform the wide-format expression data into a long format suitable for analysis.
Breadcrumbs:
First select the relevant columns (systematic_name
and the expression columns (G0.05
, etc))
Then use pivot_longer()
to convert expression columns to rows
The column names contain both nutrient type (G
) and growth rate (0.05
) information - use separate_wider_position()
to split the first character (nutrient abbreviation) from the numeric rate. The key here is to specify widths = c(nutrient_abbr = 1, rate = 4)
to split after the first character, and put the rest into rate
. Not all the rates have 4 characters, so use too_few = "align_start"
to handle that.
Create a nutrient lookup table to convert abbreviations to full names. The lookup table should look like this:
# A tibble: 6 × 2
nutrient_abbr nutrient<chr> <chr>
1 G Glucose
2 N Ammonia
3 P Phosphate
4 S Sulfate
5 L Leucine
6 U Uracil
Use that table with a join
function to add full nutrient names to the tidied data.
Convert the rate
column to numeric and select the final columns in the desired order
Remove any NA values from systematic_name
and exp_level
using filter()
This is a “cheat” chunk to load the tidy data directly if you want to skip ahead. Otherwise, use the code above to create the tidy data from the raw data.
brauer_tidy_tbl <- read_tsv(
"https://github.com/rnabioco/molb-7950/raw/refs/heads/main/data/bootcamp/brauer_gene_exp_tidy.tsv.gz"
)
yeast_gene_info_tbl <- read_tsv(
"https://github.com/rnabioco/molb-7950/raw/refs/heads/main/data/bootcamp/yeast_go_terms.tsv.gz"
)
yeast_protein_props_tbl <- read_tsv(
"https://github.com/rnabioco/molb-7950/raw/refs/heads/main/data/bootcamp/yeast_prot_prop.tsv.gz"
)
Task 6: Calculate summary statistics for gene expression by nutrient type.
Breadcrumbs: - Use group_by()
and summarize()
to calculate mean, median, and standard deviation of expression values for each nutrient. Which nutrients show the highest variability in expression?*
# Calculate summary statistics by nutrient
# Find genes with extreme expression values
Inspect the results and note any patterns you observe in high and low expression genes across different nutrient-rate combinations.
Next, make a boxplot to visualize the distribution of expression levels for each nutrient condition. What insights can you draw from the plot?
#|
Task 7: Find genes with extreme expression values under different conditions.
Breadcrumbs: For each nutrient-rate combination, identify the top 5 highest and lowest expressing genes. Use slice_max()
and slice_min()
or ranking functions. What patterns do you notice?
# Find genes with extreme expression values
Task 8: Fit linear models to examine how each gene’s expression responds to growth rate within each nutrient condition.
Breadcrumbs:
reframe()
with broom::tidy(lm(exp_level ~ rate))
for each gene-nutrient combination.by = c(nutrient, systematic_name)
to group the modelingThe above should give you a tibble that looks like this:
# A tibble: 66,430 × 7
nutrient systematic_name term estimate std.error statistic p.value<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Glucose YNL049C (Intercept) -0.263 0.0441 -5.98 0.00394
2 Glucose YNL049C rate 0.714 0.226 3.16 0.0343
3 Ammonia YNL049C (Intercept) 0.153 0.247 0.619 0.569
4 Ammonia YNL049C rate -1.09 1.27 -0.862 0.437
5 Phosphate YNL049C (Intercept) -0.465 0.161 -2.89 0.0444
6 Phosphate YNL049C rate 2.07 0.825 2.51 0.0657
7 Sulfate YNL049C (Intercept) -0.419 0.157 -2.67 0.0558
8 Sulfate YNL049C rate 2.23 0.807 2.77 0.0503
9 Leucine YNL049C (Intercept) 0.193 0.0313 6.16 0.00352
10 Leucine YNL049C rate -0.177 0.161 -1.10 0.332
# ℹ 66,420 more rows
# ℹ Use `print(n = ...)` to see more rows
Task 9: Examine the slope terms to identify genes that significantly respond to growth rate changes.
Breadcrumbs: Filter for slope terms (not intercepts). Use q-value correction for multiple testing. Create histograms of p-values by nutrient. Which genes show the strongest positive or negative relationships with growth rate?
# Analyze slope coefficients
Task 10: Use intercept terms to identify genes with unusual baseline expression under specific nutrient limitations.
Breadcrumbs: Filter for intercept terms. Center intercepts around each gene’s mean across nutrients using group_by()
and mutate()
. Use top_n()
to find genes with extreme centered intercepts.
# Analyze intercept terms
Use this above data to try to ask and answer your own biological question.
State one or more hypotheses you want to test using this data.
Generate some tables or plots to explore your hypothesis.
Interpret your results. What do they mean in the context of your hypothesis? Were you able to support or refute your hypothesis?