RNAseq DE

Matthew Taliaferro

RNA Bioscience Initiative | CU Anschutz

2024-10-21

Overview

Last time we took RNAseq data from an in vitro differentiation timecourse from mouse ESCs to glutaminergic neurons (Hubbard et al, F1000 Research (2013)). We took in transcript-level quantifications produced by salmon and collapsed them to gene-level quantifications using tximport. We then inspected the quality of the data by relating distances between samples using two methods: hierarchical clustering and principal components analysis. We found that the data was of high quality, as evidenced by the fact that replicates from a given timepoint were highly similar to other replicates for the same timepoint, and the distances between samples made sense with what we know about how the experiment was conducted.

Today, we are going to pretend that this isn’t a timecourse. For the sake of simplicity, we are going to imagine that we have only two conditions: DIV0 and DIV7. We will use DESeq2 to identify genes that are differentially expressed between these two timepoints. We will then plot changes in expression for both individual genes and groups of genes that we already know going in might be interesting to look at. Finally, we will look at some features of transcripts and genes that are differentially expressed between these two timepoints.

Prepare `t2g`

The first thing we need to do is read in the data again and move from transcript-level expression values to gene-level expression values with tximport. Let’s use biomaRt to get a table that relates gene and transcript IDs.

mart <- biomaRt::useMart(
  "ENSEMBL_MART_ENSEMBL",
  dataset = "mmusculus_gene_ensembl"
)

t2g <- biomaRt::getBM(
  attributes = c(
    "ensembl_transcript_id",
    "ensembl_gene_id",
    "external_gene_name"
  ),
  mart = mart
) |>
  as_tibble()

# maps systematic to common gene names
gene_name_map <- t2g |>
  dplyr::select(-ensembl_transcript_id) |>
  unique()

Prepare `metdata` and import

Now we can read in the transcript-level data and collapse to gene-level data with tximport

metadata <- data.frame(
  sample_id = list.files(here("data/block-rna/differentiation_salmonouts"),
    pattern = "^DIV"
  ),
  salmon_dirs = list.files(here("data/block-rna/differentiation_salmonouts"),
    recursive = T,
    pattern = "quant.sf",
    full.names = T
  )
) |>
  separate(col = sample_id, into = c("timepoint", "rep"), sep = "\\.", remove = F)

metadata$rep <- gsub(pattern = "Rep", replacement = "", metadata$rep)

rownames(metadata) <- metadata$sample_id

metadata <- metadata |>
  filter(timepoint %in% c("DIV0", "DIV7"))

salmdir <- metadata$salmon_dirs
names(salmdir) <- metadata$sample_id

txi <- tximport(
  files = salmdir,
  type = "salmon",
  tx2gene = t2g,
  dropInfReps = TRUE,
  countsFromAbundance = "lengthScaledTPM"
)

Filter lowly expressed genes

# examine distribution of TPMs
hist(log2(1 + rowSums(txi$abundance)), breaks = 40)

# decide a cutoff
keepG <- txi$abundance[log2(1 + rowSums(txi$abundance)) > 4.5, ] |>
  rownames()

Create DESeq object

There are essentially two steps to using DESeq2. The first involves creating a DESeqDataSet from your data. Luckily, if you have a tximport object, which we do in the form of txi, then this becomes easy.

# metadata$timepoint <- as.factor(metadata$timepoint)

ddsTxi <- DESeqDataSetFromTximport(
  txi,
  colData = metadata,
  design = ~timepoint
)

# keep genes with sufficient expession
ddsTxi <- ddsTxi[keepG, ]

Design formula

You can see that DESeqDataSetFromTximport wants three things. The first is our tximport object. The second is the dataframe we made that relates samples and conditions (or in this case timepoints). The last is something called a design formula. A design formula contains all of the variables that will go into DESeq2’s model. The formula starts with a tilde and then has variables separated by a plus sign think lm(). It is common practice, and in fact basically required with DESeq2, to put the variable of interest last. In our case, that’s trivial because we only have one: timepoint. So our design formula is very simple:

design = ~ timepoint

Your design formula should ideally include all of the sources of variation in your data. For example, let’s say that here we thought there was a batch effect with the replicates. Maybe all of the Rep1 samples were prepped and sequenced on a different day than the Rep2 samples and so on. We could potentially account for this in DESeq2’s model with the following forumula:

design = ~ rep + timepoint

Here, timepoint is still the variable of interest, but we are controlling for differences that arise due to differences in replicates.

Run `DESeq2`

We can see here that DESeq2 is taking the counts produced by tximport for gene quantifications. There are 52346 genes (rows) here and 7 samples (columns). Now using this ddsTxi object, we can run DESeq2.

# create DESeq object
dds <- DESeq(ddsTxi)

There are many useful things in this dds object. Take a look at the vignette for a full explanation. Including info on many more tests and analyses that can be done with DESeq2.

The results can be accessed using the results() function. We will use the contrast argument here. DESeq2 reports changes in RNA abundance between two samples as a log2FoldChange. But, it’s often not clear what the numerator and denominator of that fold change ratio…it could be either DIV7/DIV0 or DIV0/DIV7.

The lexographically first condition will be the numerator. I find it easier to explicitly specify what the numerator and denominator of this ratio are using the contrast argument. The contrast argument can be used to implement more complicated design formula. Remember our design formula that accounted for potential differences due to Replicate batch effects:

~ replicate + timepoint

DESeq2 will account for differences between replicates here to find differences between timepoints.

Contrasts to get results

# For contrast, we give three strings: the factor we are interested in, the numerator, and the denominator
results(dds, contrast = c("timepoint", "DIV7", "DIV0"))

log2 fold change (MLE): timepoint DIV7 vs DIV0 
Wald test p-value: timepoint DIV7 vs DIV0 
DataFrame with 13194 rows and 6 columns
                     baseMean log2FoldChange     lfcSE      stat       pvalue
                    <numeric>      <numeric> <numeric> <numeric>    <numeric>
ENSMUSG00000000001   5616.554      -0.950768 0.0533389 -17.82505  4.51690e-71
ENSMUSG00000000028    723.204      -2.551331 0.1076581 -23.69845 3.74081e-124
ENSMUSG00000000031   3652.438       4.121194 0.2920793  14.10984  3.30305e-45
ENSMUSG00000000037    244.247      -1.272386 0.2050176  -6.20623  5.42706e-10
ENSMUSG00000000056   2391.225       0.789381 0.1470727   5.36728  7.99320e-08
...                       ...            ...       ...       ...          ...
ENSMUSG00000121879  765.69048       0.750408 0.1201152   6.24740  4.17334e-10
ENSMUSG00000121880  169.23936      -0.505444 0.1546037  -3.26929  1.07820e-03
ENSMUSG00000121881   91.49396       1.063596 0.2658573   4.00063  6.31745e-05
ENSMUSG00002076020    2.70267       1.186826 1.1752749   1.00983  3.12578e-01
ENSMUSG00002076083 1281.86626       0.893211 0.0600879  14.86505  5.55712e-50
                           padj
                      <numeric>
ENSMUSG00000000001  2.14358e-70
ENSMUSG00000000028 3.13948e-123
ENSMUSG00000000031  1.10043e-44
ENSMUSG00000000037  8.92425e-10
ENSMUSG00000000056  1.22664e-07
...                         ...
ENSMUSG00000121879  6.89357e-10
ENSMUSG00000121880  1.40713e-03
ENSMUSG00000121881  8.74842e-05
ENSMUSG00002076020  3.39718e-01
ENSMUSG00002076083  1.97668e-49

The columns we are most interested in are log2FoldChange and padj.

log2FoldChange is self-explanatory. padj is the Benjamini-Hochberg corrected pvalue for a test asking if the expression of this gene is different between the two conditions.

Cleanup results

Let’s do a little work on this data frame to make it slightly cleaner and more informative.

diff <-
  results(
    dds,
    contrast = c("timepoint", "DIV7", "DIV0")
  ) |>
  # Change this into a dataframe
  as.data.frame() |>
  # Move ensembl gene IDs into their own column
  rownames_to_column(var = "ensembl_gene_id") |>
  # drop unused columns
  dplyr::select(-c(baseMean, lfcSE, stat, pvalue)) |>
  # Merge this with a table relating ensembl_gene_id with gene short names
  inner_join(gene_name_map) |>
  # Rename external_gene_name column
  dplyr::rename(gene = external_gene_name) |>
  as_tibble()

How many are significant

OK now we have a table of gene expression results. How many genes are significantly up/down regulated between these two timepoints? We will use 0.01 as an FDR (p.adj) cutoff.

# number of upregulated genes
nrow(filter(diff, padj < 0.01 & log2FoldChange > 0))

[1] 5295

# number of downregulated genes
nrow(filter(diff, padj < 0.01 & log2FoldChange < 0))

[1] 5327

Volcano plot of differential expression results

Let’s make a volcano plot of these results.

# meets the FDR cutoff
diff_sig <-
  mutate(
    diff,
    sig = case_when(
      padj < 0.01 ~ "yes",
      .default = "no"
    )
  ) |>
  # if a gene did not meet expression cutoffs that DESeq2 automatically does, it gets a pvalue of NA
  drop_na()

ggplot(
  diff_sig,
  aes(
    x = log2FoldChange,
    y = -log10(padj),
    color = sig
  )
) +
  geom_point(alpha = 0.2) +
  labs(
    x = "DIV7 expression / DIV0 expression, log2",
    y = "-log10(FDR)"
  ) +
  scale_color_manual(
    values = c("black", "red"),
    labels = c("NS", "FDR < 0.01"),
    name = ""
  ) +
  theme_cowplot()

Volcano plot of differential expression results

Change the LFC threshold

In addition to an FDR cutoff, let’s also apply a log2FoldChange cutoff. This will of course be more conservative, but will probably give you a more confident set of genes.

# Is the expression of the gene at least 3-fold different?
diff_lfc <-
  results(dds,
    contrast = c("timepoint", "DIV7", "DIV0"),
    lfcThreshold = log(3, 2)
  ) |>
  # Change this into a dataframe
  as.data.frame() |>
  # Move ensembl gene IDs into their own column
  rownames_to_column(var = "ensembl_gene_id") |>
  # drop unused columns
  dplyr::select(-c(baseMean, lfcSE, stat, pvalue)) |>
  # Merge this with a table relating ensembl_gene_id with gene short names
  inner_join(gene_name_map) |>
  # Rename external_gene_name column
  dplyr::rename(gene = external_gene_name) |>
  as_tibble()

# number of upregulated genes
nrow(
  filter(
    diff_lfc, padj < 0.01 & log2FoldChange > 0
  )
)

[1] 1508

# number of downregulated genes
nrow(
  filter(
    diff_lfc, padj < 0.01 & log2FoldChange < 0
  )
)

[1] 990

Change the LFC threshold

diff_lfc_sig <-
  mutate(
    diff_lfc,
    sig = case_when(
      padj < 0.01 ~ "yes",
      .default = "no"
    )
  ) |>
  drop_na()


# look at some specific genes

diff_lfc_sig |>
  filter(gene %in%
    c("Bdnf", "Dlg4", "Klf4", "Sox2")) |>
  gt()

ensembl_gene_id	log2FoldChange	padj	gene	sig
ENSMUSG00000003032	-2.344004	4.526178e-09	Klf4	yes
ENSMUSG00000020886	3.339432	1.903142e-45	Dlg4	yes
ENSMUSG00000048482	1.710541	4.864695e-01	Bdnf	no
ENSMUSG00000074637	-2.169444	5.182643e-13	Sox2	yes

Filtered Volcano

ggplot(
  diff_lfc_sig,
  aes(
    x = log2FoldChange,
    y = -log10(padj),
    color = sig
  )
) +
  geom_point(alpha = 0.2) +
  labs(
    x = "DIV7 expression / DIV0 expression, log2",
    y = "-log10(FDR)"
  ) +
  scale_color_manual(
    values = c("black", "red"),
    labels = c("NS", "FDR < 0.01"),
    name = ""
  ) +
  theme_cowplot()

Filtered Volcano

Plotting the expression of single genes

Sometimes we will have particular marker genes that we might want to highlight to give confidence that the experiment worked as expected. We can plot the expression of these genes in each replicate. Let’s plot the expression of two pluripotency genes (which we expect to decrease) and two neuronal genes (which we expect to increase).

So what is the value that we would plot? We could use the ‘normalized counts’ value provided by DESeq2. However, remember there is not length calculation so it is difficult to compare accross genes.

A more interpretable value to plot might be TPM, since TPM is length-normalized. Let’s say a gene was expressed at 500 TPM. Right off the bat, I know generally what kind of expression that reflects (pretty high).

Get TPMs

Let’s plot the expression of Klf4, Sox2, Bdnf, and Dlg4 in our samples.

tpms <- txi$abundance |>
  as.data.frame() |>
  rownames_to_column(var = "ensembl_gene_id") |>
  inner_join(gene_name_map) |>
  dplyr::rename(gene = external_gene_name) |>
  # Filter for genes we are interested in
  filter(gene %in% c("Klf4", "Sox2", "Bdnf", "Dlg4")) |>
  pivot_longer(-c(ensembl_gene_id, gene)) |>
  separate(col = name, into = c("condition", "rep"), sep = "\\.")

gt(tpms)

Get TPMs

ensembl_gene_id	gene	condition	rep	value
ENSMUSG00000003032	Klf4	DIV0	Rep1	40.022629
ENSMUSG00000003032	Klf4	DIV0	Rep2	46.241797
ENSMUSG00000003032	Klf4	DIV0	Rep3	57.374566
ENSMUSG00000003032	Klf4	DIV7	Rep1	9.063035
ENSMUSG00000003032	Klf4	DIV7	Rep2	9.208364
ENSMUSG00000003032	Klf4	DIV7	Rep3	9.715119
ENSMUSG00000003032	Klf4	DIV7	Rep4	8.817037
ENSMUSG00000020886	Dlg4	DIV0	Rep1	18.752921
ENSMUSG00000020886	Dlg4	DIV0	Rep2	15.749749
ENSMUSG00000020886	Dlg4	DIV0	Rep3	12.212636
ENSMUSG00000020886	Dlg4	DIV7	Rep1	151.558403
ENSMUSG00000020886	Dlg4	DIV7	Rep2	156.043098
ENSMUSG00000020886	Dlg4	DIV7	Rep3	153.269475
ENSMUSG00000020886	Dlg4	DIV7	Rep4	153.538436
ENSMUSG00000048482	Bdnf	DIV0	Rep1	2.456452
ENSMUSG00000048482	Bdnf	DIV0	Rep2	2.401157
ENSMUSG00000048482	Bdnf	DIV0	Rep3	2.351342
ENSMUSG00000048482	Bdnf	DIV7	Rep1	7.799981
ENSMUSG00000048482	Bdnf	DIV7	Rep2	7.762255
ENSMUSG00000048482	Bdnf	DIV7	Rep3	7.638515
ENSMUSG00000048482	Bdnf	DIV7	Rep4	7.433529
ENSMUSG00000074637	Sox2	DIV0	Rep1	131.032132
ENSMUSG00000074637	Sox2	DIV0	Rep2	120.433905
ENSMUSG00000074637	Sox2	DIV0	Rep3	110.664468
ENSMUSG00000074637	Sox2	DIV7	Rep1	24.484942
ENSMUSG00000074637	Sox2	DIV7	Rep2	26.080394
ENSMUSG00000074637	Sox2	DIV7	Rep3	27.106054
ENSMUSG00000074637	Sox2	DIV7	Rep4	26.995461

Now plot

ggplot(
  tpms,
  aes(
    x = condition,
    y = value,
    color = condition
  )
) +
  geom_jitter(size = 2, width = .25) +
  labs(
    x = "",
    y = "TPM"
  ) +
  theme_cowplot() +
  scale_color_manual(values = c("blue", "red")) +
  facet_wrap(~gene, scales = "free_y")

How about pathways?

Say that instead of plotting individual genes we wanted to ask whether a whole class of genes are going up or down. We can do that by retrieving all genes that belong to a particular gene ontology term.

There are three classes of genes we will look at here:

Maintenance of pluripotency (GO:0019827)
Positive regulation of the cell cycle (GO:0045787)
Neuronal differentitaion (GO:0030182)

Retrieve pathway information

We can use biomaRt to get all genes that belong to each of these categories. Think of it like doing a gene ontology enrichment analysis in reverse.

pluripotencygenes <- getBM(
  attributes = c("ensembl_gene_id"),
  filters = c("go_parent_term"),
  values = c("GO:0019827"),
  mart = mart
)

cellcyclegenes <- getBM(
  attributes = c("ensembl_gene_id"),
  filters = c("go_parent_term"),
  values = c("GO:0045787"),
  mart = mart
)

neurongenes <- getBM(
  attributes = c("ensembl_gene_id"),
  filters = c("go_parent_term"),
  values = c("GO:0030182"),
  mart = mart
)


# pathway <- bind_rows(pluripotencygenes,
#           cellcyclegenes,
#           neurongenes
#           )
#
# pathway$path <- c(
#   rep("pluri",nrow(pluripotencygenes)),
#   rep("cellcycle",nrow(cellcyclegenes)),
#   rep("neuron",nrow(neurongenes))
#                   )
#
# write_csv(x = pathway, file = here("data","block-rna","pathwaygenes.csv.gz"))

Add pathway information to results

You can see that these items are one-column dataframes that have the column name ‘ensembl_gene_id’. We can now go through our results dataframe and add an annotation column that marks whether the gene is in any of these categories.

diff_paths <-
  diff_lfc |>
  mutate(
    annot = case_when(
      ensembl_gene_id %in% pluripotencygenes$ensembl_gene_id ~ "pluripotency",
      ensembl_gene_id %in% cellcyclegenes$ensembl_gene_id ~ "cellcycle",
      ensembl_gene_id %in% neurongenes$ensembl_gene_id ~ "neurondiff",
      .default = "none"
    )
  ) |>
  drop_na() # drop na

# Reorder these for plotting purposes
diff_paths$annot <-
  factor(
    diff_paths$annot,
    levels = c(
      "none", "cellcycle",
      "pluripotency", "neurondiff"
    )
  )

Are there significant differences?

OK we’ve got our table, now we are going to ask if the log2FoldChange values for the genes in each of these classes are different that what we would expect. So what is the expected value? Well, we have a distribution of log2 fold changes for all the genes that are not in any of these categories. So we will ask if the distribution of log2 fold changes for each gene category is different than that null distribution.

pvals <- rstatix::wilcox_test(data = diff_paths, log2FoldChange ~ annot, ref.group = "none")


p.pluripotency <- pvals |>
  filter(group2 == "pluripotency") |>
  pull(p.adj)

p.cellcycle <- pvals |>
  filter(group2 == "cellcycle") |>
  pull(p.adj)


p.neurondiff <- pvals |>
  filter(group2 == "neurondiff") |>
  pull(p.adj)

plot pathway differences

ggplot(
  diff_paths,
  aes(
    x = annot,
    y = log2FoldChange,
    fill = annot
  )
) +
  labs(
    x = "Gene class",
    y = "DIV7/DIV0, log2"
  ) +
  geom_hline(
    yintercept = 0,
    color = "gray",
    linetype = "dashed"
  ) +
  geom_boxplot(
    notch = TRUE,
    outlier.shape = NA
  ) +
  theme_cowplot() +
  scale_fill_manual(values = c("gray", "red", "blue", "purple"), guide = F) +
  scale_x_discrete(
    labels = c(
      "none", "Cell cycle", "Pluripotency", "Neuron\ndifferentiation"
    )
  ) +
  ylim(-5, 7) +
  # hacky significance bars
  annotate("segment", x = 1, xend = 2, y = 4, yend = 4) +
  annotate("segment", x = 1, xend = 3, y = 5, yend = 5) +
  annotate("segment", x = 1, xend = 4, y = 6, yend = 6) +
  annotate("text", x = 1.5, y = 4.4, label = paste0("p = ", p.cellcycle)) +
  annotate("text", x = 2, y = 5.4, label = paste0("p = ", p.pluripotency)) +
  annotate("text", x = 2.5, y = 6.4, label = paste0("p = ", p.neurondiff))

plot pathway differences

What if we want to look at pathways in an unbiased way?

We will use Gene Set Enrichment Analysis (GSEA) to determine if pre-defined gene sets (pathways, GO terms, experimentally defined genes) are coordinately up-regulated or down-regulated between the two conditions you are comparing. To run gsea you need 2 things. 1. You list of expressed genes ranked by fold change. 2. Pre-defined gene sets. See MSigDb

PMID: 12808457, 16199517

GSEA examples

Top = upregulated

Bottom = downregulated

Prep GSEA

We need to make a list of all genes and their LFC.
We need to find interesting gene sets.
Run GSEA

# retrieve hallmark gene set from msigdb
mouse_hallmark <- msigdbr(species = "Mus musculus") %>%
  filter(gs_cat == "H") %>%
  dplyr::select(gs_name, gene_symbol)

# create a list of gene LFCs
rankedgenes <- diff_lfc %>% pull(log2FoldChange)

# add symbols as names of the list
names(rankedgenes) <- diff$gene

# sort by LFC
rankedgenes <- sort(rankedgenes, decreasing = TRUE)

# deduplicate
rankedgenes <- rankedgenes[!duplicated(names(rankedgenes))]

Prep GSEA

Run GSEA

# run gsea
div7vs0 <- GSEA(
  geneList = rankedgenes,
  eps = 0,
  pAdjustMethod = "fdr",
  pvalueCutoff = .05,
  minGSSize = 20,
  maxGSSize = 1000,
  TERM2GENE = mouse_hallmark
)

div7vs0@result |>
  select(ID, NES, p.adjust) |>
  gt()

Run GSEA

ID	NES	p.adjust
HALLMARK_E2F_TARGETS	-2.990087	2.701379e-31
HALLMARK_G2M_CHECKPOINT	-2.919748	4.509204e-28
HALLMARK_MYC_TARGETS_V1	-2.208258	5.897138e-11
HALLMARK_MITOTIC_SPINDLE	-1.992387	6.402722e-08
HALLMARK_EPITHELIAL_MESENCHYMAL_TRANSITION	-1.970254	1.490220e-06
HALLMARK_COAGULATION	-2.109912	1.301102e-04
HALLMARK_IL6_JAK_STAT3_SIGNALING	-2.087682	2.650917e-04
HALLMARK_IL2_STAT5_SIGNALING	-1.735749	2.650917e-04
HALLMARK_TNFA_SIGNALING_VIA_NFKB	-1.678455	9.835180e-04
HALLMARK_ANDROGEN_RESPONSE	-1.730713	1.807134e-03
HALLMARK_APOPTOSIS	-1.678930	1.958207e-03
HALLMARK_MYC_TARGETS_V2	-1.790632	4.393568e-03
HALLMARK_TGF_BETA_SIGNALING	-1.795252	6.244282e-03
HALLMARK_ESTROGEN_RESPONSE_LATE	-1.530036	6.244282e-03
HALLMARK_XENOBIOTIC_METABOLISM	-1.512733	8.448153e-03
HALLMARK_WNT_BETA_CATENIN_SIGNALING	-1.728399	1.193390e-02
HALLMARK_INTERFERON_ALPHA_RESPONSE	-1.708185	1.193390e-02
HALLMARK_ESTROGEN_RESPONSE_EARLY	-1.449674	1.355045e-02
HALLMARK_ALLOGRAFT_REJECTION	-1.478140	2.127426e-02
HALLMARK_INFLAMMATORY_RESPONSE	-1.450568	2.711999e-02
HALLMARK_NOTCH_SIGNALING	-1.598029	3.518901e-02
HALLMARK_INTERFERON_GAMMA_RESPONSE	-1.417723	3.808103e-02
HALLMARK_COMPLEMENT	-1.357838	4.117068e-02

Plot GSEA

# plot "HALLMARK_G2M_CHECKPOINT"
gseaplot(x = div7vs0, geneSetID = "HALLMARK_G2M_CHECKPOINT")

RNAseq DE

Overview

Prepare t2g

Prepare metdata and import

Filter lowly expressed genes

Create DESeq object

Design formula

Run DESeq2

Contrasts to get results

Cleanup results

How many are significant

Volcano plot of differential expression results

Volcano plot of differential expression results

Change the LFC threshold

Change the LFC threshold

Filtered Volcano

Filtered Volcano

Plotting the expression of single genes

Get TPMs

Get TPMs

Now plot

How about pathways?

Retrieve pathway information

Add pathway information to results

Are there significant differences?

plot pathway differences

plot pathway differences

What if we want to look at pathways in an unbiased way?

GSEA examples

Prep GSEA

Prep GSEA

Run GSEA

Run GSEA

Plot GSEA

Prepare `t2g`

Prepare `metdata` and import

Run `DESeq2`