Exercises 8


Jay Hesselberth


October 21, 2024

Putting it all together

For the next two classes we’ll combine everything we’ve learned to process and visualize data from some some biological experiments. These exercises will illustrate a complete analysis pipeline – from data tidying to manipulation and visualization – using tools from the tidyverse.


Load the libraries you need for analysis below.

A quantitative PCR experiment

Here is the experimental setup:

  • Two cell lines (wt and mut) were treated with a drug that induces interferon expression

  • After specific time points, cells were harvested and actin and interferon mRNA were analyzed by quantitative PCR (with 3 technical replicates), with a control containing no reverse transcriptase.

Load the data

These data are in two TSV files:

  • data/qpcr_names.tsv.gz
  • data/qpcr_data.tsv.gz

Load these data sets and inspect.

qpcr_names <- read_tsv(here("data/bootcamp/qpcr_names.tsv.gz"))
qpcr_data <- read_tsv(here("data/bootcamp/qpcr_data.tsv.gz"))
Note the shape of the data and the names of the rows and columns. Do they remind you of anything?

Tidy the data

Given the experimental setup and the shape of the tibbles, you should be able to answer: Are these data tidy?

  • What are the variables in the data?
  • Are the variables the column names?
qpcr_data_long <-
  pivot_longer(qpcr_data, -row, names_to = "col")

qpcr_names_long <- 
  pivot_longer(qpcr_names, -row, names_to = "col") |>
    into = c("gt", "time", "gene", "rt", "rep"),
    sep = "_"

Merge the data

Note the structure of the tidied data. What columns (variables) are shared by both tibbles?

How we can join the data from these two tibbles, linking the sample identifiers with their gene expression values?

qpcr_tidy <-
  left_join(qpcr_names_long, qpcr_data_long) |>
  # we don't need row & col anymore.
  # the -RT samples are all 0, so we can drop those, too
  filter(rt == "+") |>
  select(-(row:col), -rt)
Summarize the data

Calculate the mean and standard deviation across replicates.

Do this two ways:

  1. Calculate the statistics for each gene separately.
  2. Calculate a ratio of interferon to actin levels for each sample before calculating the mean and standard deviation of the ratios.
qpcr_summary <-
    gt, time, gene) |>
    exp_mean = mean(value),
    exp_sd = sd(value)
  ) |>
  arrange(gt, time, gene)
Plot the data

Now we can plot the summary statistics. We’ll use ggplot2::geom_pointrange() to represent the mean and standard deviation.

You’ll need to fill in the blanks (___) below.

    x = ___,
    y = ___,
    color = ___
) +
      ymin = ___,
      ymax = ___ 
    # position = ___

Inspect the above plot. How might you improve it?

Copy the above chunk and add functions that modify the plot’s look and feel.

  • Facet the plot to see differences between the genotypes.
  • Update the theme using cowplot.
  • Update the x, y, and title labels (ggplot2::labs()).
  • Update the colors with a nicer palette (ggplot2::scale_*).
  • Fix the position of the geoms by updating their position aesthetic.

Interpret the plot

  • What can you say about the expression of ACTIN and IFN?
  • What can you say about the mutant and wild-type cells?