RNA-sequencing intro
RNA Bioscience Initiative | CU Anschutz
2024-10-21
Messenger RNA (mRNA) carries genetic information encoded in DNA required for making proteins.
Typically refers to “long” RNAs i.e. mRNA and long non-coding RNA (lncRNA). Specifically, we capture the steady-state pool of mature mRNA and to a lesser degree pre-mRNA. Thus, we can easily assess the abundance and isoforms expressed in the sample of interest. Of course, long read RNA sequencing (Nanopore, PacBio) enable better detection of continuity of exons and full-length isoforms.
Need to determine which population of RNA you are interested in sequencing. The vast majority (~80%) of RNA in the cell is from ribosomal RNA. Smaller regulatory non-coding RNA are typically excluded due to size selection (snRNA, snoRNA, tRNA, miRNA).
polyA selection: uses oligo dT to hybridize to poly A tails of mRNA (and many long non-coding RNA)
depletion of rRNA: uses DNA oligos complementary to portions of rRNA to either remove (purification) or degrade (RNaseH) to avoid rRNA getting into the library.
size selection: sequence a population of RNAs that have a specific length such as microRNAs, which are ~21 nt regulatory non-coding RNAs.
…and the genome has complex organization.
strand-specificity is crucial
You get your data back and align the reads to the genome, right? Nope, we need to deal with reads that will need to be “split” - spliced exons - to properly align. There are two strategies to deal with this: 1) Spliced alignments and 2) Pseudoalignment (transcripts).
OR align/quantify in the same step. Fast and accurate…but you need to provide the transcripts (cannot discover new isoforms).
Create an index to evaluate the sequences for all possible unique sequences of length k (k-mer) in the transcriptome from known transcripts (splice isoforms for all genes).
The Salmon index has two components:
The quasi-mapping approach estimates where the reads best map to on the transcriptome through identifying where informative sequences within the read map to instead of performing base-by-base alignment.
After determining the best mapping for each read/fragment, salmon will generate the final transcript abundance estimates after modeling sample-specific parameters and biases. Note that reads/fragments that map equally well to more than one transcript will have the count divided between all of the mappings; thereby not losing information for the various gene isoforms.
Salmon and Kallisto account for:
GC bias
positional coverage biases
sequence biases at 5’ and 3’ ends of the fragments
fragment length distribution
strand-specificity
Need to deal with systematic differences within/between samples such as:
CPM — (read) Counts Per Million: \[CPM = \displaystyle \frac {\#\ reads\ mapped\ to\ gene*10^6}{Total\ \#\ mapped\ reads}\]
[F|R]PKM — [Fragments|Reads] Per Kilobase per Million: \[FPKM = \displaystyle \frac {\#\ fragments\ mapped\ to\ gene*10^6}{Total\ \#\ mapped\ reads*transcript\ length}\]
TPM — Transcripts Per Million: \[TPM = \displaystyle A* \frac {1}{\sum_{}A}\] \[A = \displaystyle \frac {\#\ fragments\ mapped\ to\ gene*10^6}{transcript\ length}\]
RPKM and FPKM are pretty much the same thing. FPKM is for fragments (paired-ends), not reads (single-end).
TPM and [F|R]PKM account for length differences between transcripts.
The sum of all [F|R]PKMs may not be the same across samples.
TPM normalizes to gene length first and then normalize for sequencing depth. Thus the sum of all TPMs is the same across samples. A better measure of relative transcript “concentration” in your sample than [F|R]PKM.
Can I compare TPM / *PKM / CPM across samples?
It depends what you mean by “compare”. Because these measures are purely relative, you cannot reliably use a metric like TPM to directly assess differences in transcript abundance across samples. Specifically, changes in the composition (e.g. polyA vs rRNA-depleted) of a sample can lead to changes in TPM, even if transcripts are, in reality, expressed at the same true level across samples. Metrics like this can be useful for “high-level” comparisons (e.g. visualizing samples etc.). However, whenever using a relative metric like this, one should be aware of its relative nature and the potential caveats that go along with interpreting them between samples.
We do NOT use TPM differences for differential expression.
Misuse of RPKM or TPM normalization when comparing across samples and sequencing protocols
DO NOT USE TPM (or anything we just talked about) to perform differential expression analysis.
RNA-seq data are discrete non-negative integers (counts per transcripts).
Remember the reads are (pseudo-)aligned and we COUNT how many are assigned to a specific transcript in a given sample.
Reads are count based and not normally distributed. Two distributions for count based data are poisson (which presumes the variance and mean are equal) or negative binomial (which does not). This is especially a problem when the number of biological replicates are low because it is hard to accurately model variance of count based data if you are looking at only that gene and making the assumptions of normally distributed continuous data (ie a t-test).
Overdispersion the variance of counts is generally greater than their mean, especially for genes expressed at a higher level.
The total number of reads for each sample tends to be in the millions, while the counts per gene are much lower (many zeros, tens/hundreds) and vary considerably. While the Poisson distribution seems appropriate for sampling out of a large pool with low probability. Poisson does not handle overdispersion, enter the Negative Binomial distribution.
d <- read_csv(here("data", "unfilt_counts.csv.gz")) |> as.matrix()
df <- tibble(
variance = rowVars(d),
mean = rowMeans(d)
)
ggplot(df) +
geom_point(aes(x = mean, y = variance)) +
scale_y_log10(limits = c(1, 1e9)) +
scale_x_log10(limits = c(1, 1e9)) +
geom_abline(intercept = 0, slope = 1, color = "red") +
theme_cowplot()
data points do not fall on the diagonal, mean != var
for highly expressed genes, var > mean
lowly expressed genes have more scatter i.e. “heteroscedasticity”.
Why do we use the negative binomial distribution for analysing RNAseq data?
where counts \(K_{ij}\) for gene i
, sample j
are modeled using a negative binomial distribution with fitted mean \(\mu_{ij}\) and a gene-specific dispersion parameter \(\alpha_i\). The fitted mean is composed of a sample-specific size factor \(s_{j}\) and a parameter \(q_{ij}\) proportional to the expected true concentration of fragments for sample j
.
\[\log_2(q_{ij}) = x_{j.} \beta_i\] The coefficients \(\beta_{i}\) give log2 fold changes for gene i
`for each column of the model matrix X. Note that the model can be generalized to use sample- and gene-dependent normalization factors \(s_{ij}\).
The counts divided by sample-specific size factors determined by median ratio of gene counts relative to geometric mean per gene.
Step 1: creates a pseudo-reference sample (row-wise geometric mean)
Step 2: calculates ratio of each sample to the reference
Step 3: calculate the normalization factor for each sample (size factor)
mock_rna_A mock_rna_B mock_rna_C 8430_rna_A 8430_rna_B 8430_rna_C
0.9248615 0.8867074 1.0318455 1.0693342 1.0238999 1.1038130
We need to generate accurate estimates of within-group variation for each gene…but usually have only 3 replicates making it hard to estimate reliably.
DESeq2 shares information across genes to generate more accurate estimates of variation based on the mean expression level of the gene using a method called ‘shrinkage’. DESeq2 assumes that genes with similar expression levels have similar dispersion.
Estimating the dispersion for each gene separately:
To model the dispersion based on expression level (mean counts of replicates), the dispersion for each gene is estimated using maximum likelihood estimation. In other words, given the count values of the replicates, the most likely estimate of dispersion is calculated.
A study compared mRNA expression profiles of many human and mouse tissues. One of their key findings:
GENE EXPRESSION IS MORE SIMILAR AMONG TISSUES WITHIN A SPECIES THAN BETWEEN CORRESPONDING TISSUES OF THE TWO SPECIES
Replicates allow us to:
Course website: https://rnabioco.github.io/molb-7950