Exercises 8: Gene Expression Analysis with Tidyverse
Advanced Modeling and Functional Analysis
1 Overview
In this problem set, you’ll work with the Brauer gene expression dataset to practice comprehensive tidyverse skills including data tidying, transformation, joins, pivoting, string manipulation, and statistical modeling using broom. The dataset contains gene expression measurements for yeast genes under different nutrient limitations and growth rates.
1.1 Predictions
Before we start tidying and analyzing the data, take a moment to predict what you might find.
- Question: What patterns do you expect to see in gene expression across different nutrients and growth rates?
- Hypothesis: Genes involved in nutrient uptake and metabolism will show higher expression under their respective limiting conditions.
- Hypohtesis: The different nutrient conditions will cause distinct gene expression profiles.
2 Setup and Data Loading
2.1 Load Required Libraries
2.2 Load the Data
Task 1: Load the raw Brauer gene expression data and examine its structure. What makes this data “messy” or untidy?
Breadcrumbs: Use read_tsv() to load the data from the URL. Examine column names and the first few rows. Think about tidy data principles - what issues do you see with the current format?
Code
# Load the Brauer gene expression data
url <- "https://github.com/rnabioco/molb-7950/raw/refs/heads/main/data/bootcamp/brauer_gene_exp_raw.tsv.gz"
brauer_raw_tbl <- read_tsv(url)Rows: 5537 Columns: 44
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (6): GID, YORF, name, BP, MF, systematic_name
dbl (38): number, GWEIGHT, G0.05, G0.1, G0.15, G0.2, G0.25, G0.3, N0.05, N0....
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
# Examine the structure of the data3 Part 1: Data Tidying (tidyr)
We want to create a tibble that looks like this:
# A tibble: 199,296 × 4
systematic_name nutrient rate exp_level
<chr> <fct> <dbl> <dbl>
1 YPR204W Glucose 0.05 1.17
2 YPR204W Glucose 0.1 1
3 YPR204W Glucose 0.15 0.86
4 YPR204W Glucose 0.2 0.77
5 YPR204W Glucose 0.25 0.53
6 YPR204W Glucose 0.3 0.3
7 YPR204W Ammonia 0.05 2.79
8 YPR204W Ammonia 0.1 2
9 YPR204W Ammonia 0.15 0.6
10 YPR204W Ammonia 0.2 0.16
# ℹ 199,286 more rows
# ℹ Use `print(n = ...)` to see more rowsNote the classes of each column - systematic_name is character, nutrient is a factor, rate is numeric, and exp_level is numeric.
3.1 Create a Tidy Dataset
Task 3: Transform the wide-format expression data into a long format suitable for analysis.
Breadcrumbs:
First select the relevant columns (
systematic_nameand the expression columns (G0.05, etc))Then use
pivot_longer()to convert expression columns to rowsThe column names contain both nutrient type (
G) and growth rate (0.05) information - useseparate_wider_position()to split the first character (nutrient abbreviation) from the numeric rate. The key here is to specifywidths = c(nutrient_abbr = 1, rate = 4)to split after the first character, and put the rest intorate. Not all the rates have 4 characters, so usetoo_few = "align_start"to handle that.-
Create a nutrient lookup table to convert abbreviations to full names. The lookup table should look like this:
# A tibble: 6 × 2 nutrient_abbr nutrient <chr> <chr> 1 G Glucose 2 N Ammonia 3 P Phosphate 4 S Sulfate 5 L Leucine 6 U Uracil Use that table with a
joinfunction to add full nutrient names to the tidied data.Convert the
ratecolumn to numeric and select the final columns in the desired orderRemove any NA values from
systematic_nameandexp_levelusingfilter()
Code
# nutrient abbreviations
nutrient_lookup <- tibble(
nutrient_abbr = c("G", "N", "P", "S", "L", "U"),
nutrient = c(
"Glucose",
"Ammonia",
"Phosphate",
"Sulfate",
"Leucine",
"Uracil"
)
)
# Transform to long format
brauer_tidy_tbl <-
brauer_raw_tbl |>
select(systematic_name, G0.05:U0.3) |>
pivot_longer(
-systematic_name,
names_to = "condition",
values_to = "exp_level"
) |>
separate_wider_position(
condition,
widths = c(nutrient_abbr = 1, rate = 4),
too_few = "align_start"
) |>
mutate(rate = as.numeric(rate)) |>
left_join(nutrient_lookup, by = "nutrient_abbr") |>
select(systematic_name, nutrient, rate, exp_level) |>
filter(!is.na(systematic_name) & !is.na(exp_level))4 Part 2: Expression patterns
This is a “cheat” chunk to load the tidy data directly if you want to skip ahead. Otherwise, use the code above to create the tidy data from the raw data.
Code
brauer_tidy_tbl <- read_tsv(
"https://github.com/rnabioco/molb-7950/raw/refs/heads/main/data/bootcamp/brauer_gene_exp_tidy.tsv.gz"
) |>
select(systematic_name, G0.05:U0.3) |>
pivot_longer(
-systematic_name,
names_to = "condition",
values_to = "exp_level"
) |>
separate_wider_position(
condition,
widths = c(nutrient_abbr = 1, rate = 4),
too_few = "align_start"
) |>
mutate(rate = as.numeric(rate)) |>
left_join(nutrient_lookup, by = "nutrient_abbr") |>
select(systematic_name, nutrient, rate, exp_level) |>
filter(!is.na(systematic_name) & !is.na(exp_level))Rows: 5537 Columns: 37
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (1): systematic_name
dbl (36): G0.05, G0.1, G0.15, G0.2, G0.25, G0.3, N0.05, N0.1, N0.15, N0.2, N...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
yeast_gene_info_tbl <- read_tsv(
"https://github.com/rnabioco/molb-7950/raw/refs/heads/main/data/bootcamp/yeast_go_terms.tsv.gz"
)Rows: 5537 Columns: 4
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (4): systematic_name, common_name, go_molecular_function, go_biological_...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
yeast_protein_pros_tbl <- read_tsv(
"https://github.com/rnabioco/molb-7950/raw/refs/heads/main/data/bootcamp/yeast_prot_prop.tsv.gz"
)Rows: 6718 Columns: 40
── Column specification ────────────────────────────────────────────────────────
Delimiter: "\t"
chr (17): ORF, N_term_seq, C_term_seq, GRAVY Score, Aromaticity Score, CAI, ...
dbl (23): Mw, PI, Protein Length, Ala, Cys, Asp, Glu, Phe, Gly, His, Ile, Ly...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
4.1 Explore Expression Patterns
Task 6: Calculate summary statistics for gene expression by nutrient type.
Breadcrumbs: - Use group_by() and summarize() to calculate mean, median, and standard deviation of expression values for each nutrient. Which nutrients show the highest variability in expression?*
Code
# A tibble: 6 × 4
nutrient exp_mean exp_median exp_sd
<chr> <dbl> <dbl> <dbl>
1 Ammonia -0.000671 0.01 0.793
2 Phosphate -0.00207 0.03 0.769
3 Sulfate -0.00381 0.01 0.686
4 Uracil -0.00141 -0.01 0.649
5 Glucose 0.0223 -0.01 0.576
6 Leucine 0.00577 0 0.478
Code
# A tibble: 180 × 4
systematic_name nutrient rate exp_level
<chr> <chr> <dbl> <dbl>
1 YKR039W Ammonia 0.05 6.64
2 YJR152W Ammonia 0.05 6.64
3 YNL142W Ammonia 0.05 6.03
4 YHR096C Ammonia 0.05 5.59
5 YIR028W Ammonia 0.05 5.56
6 YKR039W Ammonia 0.1 6.64
7 YJR152W Ammonia 0.1 6.64
8 YNL142W Ammonia 0.1 6.13
9 YDR345C Ammonia 0.1 5.67
10 YNL117W Ammonia 0.1 5.48
# ℹ 170 more rows
Inspect the results and note any patterns you observe in high and low expression genes across different nutrient-rate combinations.
Next, make a boxplot to visualize the distribution of expression levels for each nutrient condition. What insights can you draw from the plot?
4.2 Identify High and Low Expression
Task 7: Find genes with extreme expression values under different conditions.
Breadcrumbs: For each nutrient-rate combination, identify the top 5 highest and lowest expressing genes. Use slice_max() and slice_min() or ranking functions. What patterns do you notice?
Code
# Find genes with extreme expression values5 Part 3: Statistical Modeling with broom
5.1 Linear Models for Individual Genes
Task 8: Fit linear models to examine how each gene’s expression responds to growth rate within each nutrient condition.
Breadcrumbs:
- Use
reframe()withbroom::tidy(lm(exp_level ~ rate))for each gene-nutrient combination - Use
.by = c(nutrient, systematic_name)to group the modeling - This creates a tidy data frame of model coefficients
Code
Warning: There was 1 warning in `reframe()`.
ℹ In argument: `broom::tidy(lm(exp_level ~ rate))`.
ℹ In group 27769: `nutrient = "Ammonia"`, `systematic_name = "YJR152W"`.
Caused by warning in `summary.lm()`:
! essentially perfect fit: summary may be unreliable
The above should give you a tibble that looks like this:
# A tibble: 66,430 × 7
nutrient systematic_name term estimate std.error statistic p.value
<chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
1 Glucose YNL049C (Intercept) -0.263 0.0441 -5.98 0.00394
2 Glucose YNL049C rate 0.714 0.226 3.16 0.0343
3 Ammonia YNL049C (Intercept) 0.153 0.247 0.619 0.569
4 Ammonia YNL049C rate -1.09 1.27 -0.862 0.437
5 Phosphate YNL049C (Intercept) -0.465 0.161 -2.89 0.0444
6 Phosphate YNL049C rate 2.07 0.825 2.51 0.0657
7 Sulfate YNL049C (Intercept) -0.419 0.157 -2.67 0.0558
8 Sulfate YNL049C rate 2.23 0.807 2.77 0.0503
9 Leucine YNL049C (Intercept) 0.193 0.0313 6.16 0.00352
10 Leucine YNL049C rate -0.177 0.161 -1.10 0.332
# ℹ 66,420 more rows
# ℹ Use `print(n = ...)` to see more rows5.2 Analyze Slope Coefficients
Task 9: Examine the slope terms to identify genes that significantly respond to growth rate changes.
Breadcrumbs: Filter for slope terms (not intercepts). Use q-value correction for multiple testing. Create histograms of p-values by nutrient. Which genes show the strongest positive or negative relationships with growth rate?
Code
# Analyze slope coefficients5.3 Analyze Intercept Terms
Task 10: Use intercept terms to identify genes with unusual baseline expression under specific nutrient limitations.
Breadcrumbs: Filter for intercept terms. Center intercepts around each gene’s mean across nutrients using group_by() and mutate(). Use top_n() to find genes with extreme centered intercepts.
Code
# Analyze intercept terms6 Your assignment
Use this above data to try to ask and answer your own biological question.
6.1 Hypothesis
State one or more hypotheses you want to test using this data.
6.2 Explore / Analyze
Generate some tables or plots to explore your hypothesis.
6.3 Interpreation
Interpret your results. What do they mean in the context of your hypothesis? Were you able to support or refute your hypothesis?
