RNA-seq analysis in R
- Introduce matrices and matrix manipulation in R
- Build a DESeqDataSet and calculate differential gene expression
- Plot results and QC metrics
Matrices in R are two dimensional objects that contain elements of one atomic type
Matrices can contain characters or logical values, but we’re really only interested in numeric matrices:
# Build 4x3 matrix with row-wise series of integers
M <- matrix(1:12,
nrow = 4,
byrow = TRUE)
M
#> [,1] [,2] [,3]
#> [1,] 1 2 3
#> [2,] 4 5 6
#> [3,] 7 8 9
#> [4,] 10 11 12
Elements of a matrix can be accessed using column and row indices or names
To return the value in the first row, second column:
M[1,2]
#> [1] 2
With named rows and columns:
# Assign column names
colnames(M) <- c("col1",
"col2",
"col3")
# Assign row names
row.names(M) <- c("row1",
"row2",
"row3",
"row4")
M["row1","col2"]
#> [1] 2
Index ranges and character vectors of row/column names also work:
M[1:2,2:3]
#> col2 col3
#> row1 2 3
#> row2 5 6
M[c("row1","row2"),c("col2","col3")]
#> col2 col3
#> row1 2 3
#> row2 5 6
As does negative selection:
# Exclude rows 1 and 4
M[-c(1,4),]
#> col1 col2 col3
#> row2 4 5 6
#> row3 7 8 9
Full rows and columns can be extracted as well:
# print 1st row
M[1,]
#> col1 col2 col3
#> 1 2 3
# print 2nd column
M[,2]
#> row1 row2 row3 row4
#> 2 5 8 11
# print 2nd and 3rd rows
M[2:3,]
#> col1 col2 col3
#> row2 4 5 6
#> row3 7 8 9
Mathematical operations can be performed on numeric matrices
For example, taking the base 2 logarithm of each value:
M_log <- log2(M)
M_log
#> col1 col2 col3
#> row1 0.000000 1.000000 1.584963
#> row2 2.000000 2.321928 2.584963
#> row3 2.807355 3.000000 3.169925
#> row4 3.321928 3.459432 3.584963
Or normalizing all columns relative to an arbitrary reference column:
M_norm <- M/M[,1]
M_norm
#> col1 col2 col3
#> row1 1 2.000000 3.000000
#> row2 1 1.250000 1.500000
#> row3 1 1.142857 1.285714
#> row4 1 1.100000 1.200000
Matrix Exercises
- Build a 100x100 row-wise matrix of sequential integers named
mat
. - Create a new matrix,
mat_sub
, with columns 2-10 and rows 40-60 from themat
. - Calculate the maximum and minimum values present in
mat_sub
. - Using the
rowMeans()
function, subtract the row-wise mean from each value inmat_sub
.
Analyzing RNA-Seq Data
Simplified RNA sequencing analysis workflow (genome alignment)
Step | Tools | Description |
---|---|---|
Align reads to genome | STAR, HISAT2 | Align reads to genomic sequence, retaining strand information if present |
Assign reads to features | Subread, StringTie | Assign aligned reads to genes and/or transcripts |
Calculate differential expression | DESeq2, edgeR, Ballgown | Fit count matrices to model to determine differential expression |
Simplified RNA sequencing analysis workflow (quasi-mapping)
Step | Tools | Description |
---|---|---|
Estimate transcript abundance | Salmon / Kallisto | Estimate transcript abundance directly on transcriptome sequences |
Calculate differential expression | DESeq2, edgeR, Ballgown | Fit count matrices to model to determine differential expression |
Inspect RNA-seq count and phenotype data from file that is included in pbda package
# View first few rows of count matrix
head(drug_resistant_counts)
#> parent1 parent2 parent3 A_resistant1 A_resistant2 A_resistant3
#> DDX11L1 0 0 0 0 0 0
#> WASH7P 3 7 13 7 12 9
#> MIR6859-1 1 1 1 2 1 2
#> MIR1302-2 0 0 0 0 0 0
#> FAM138A 0 0 0 0 0 0
#> OR4F5 0 0 0 0 0 0
#> B_resistant1 B_resistant2 B_resistant3 C_resistant1
#> DDX11L1 0 0 0 0
#> WASH7P 10 9 11 24
#> MIR6859-1 2 2 2 6
#> MIR1302-2 0 0 0 0
#> FAM138A 0 0 0 0
#> OR4F5 0 0 0 0
#> C_resistant2 C_resistant3
#> DDX11L1 0 0
#> WASH7P 7 9
#> MIR6859-1 0 1
#> MIR1302-2 0 0
#> FAM138A 0 0
#> OR4F5 0 0
# View phenotype data
drug_resistant_coldata
#> drug
#> parent1 none
#> parent2 none
#> parent3 none
#> A_resistant1 A
#> A_resistant2 A
#> A_resistant3 A
#> B_resistant1 B
#> B_resistant2 B
#> B_resistant3 B
#> C_resistant1 C
#> C_resistant2 C
#> C_resistant3 C
DESeq2 uses DESeqDataSets, a subset of the Bioconductor SummarizedExperiment object, to store data
SummarizedExperiment objects provide a structured, self-contained way to store phenotype, gene and assay data, along with analysis intermediates and parameters.
knitr::include_graphics("img/class-4-se.png")
Build a DESeqDataSet from the drug_resistant data and inspect the object
# Create DESeqDataSet from drug_resistant data
dds <- DESeqDataSetFromMatrix(countData = drug_resistant_counts,
colData = drug_resistant_coldata,
design = ~drug)
# View DESeqDataSet
dds
#> class: DESeqDataSet
#> dim: 33121 12
#> metadata(1): version
#> assays(1): counts
#> rownames(33121): DDX11L1 WASH7P ... VAMP7-2 IL9R-2
#> rowData names(0):
#> colnames(12): parent1 parent2 ... C_resistant2 C_resistant3
#> colData names(1): drug
# View design
dds@design
#> ~drug
# View colData
colData(dds)
#> DataFrame with 12 rows and 1 column
#> drug
#> <factor>
#> parent1 none
#> parent2 none
#> parent3 none
#> A_resistant1 A
#> A_resistant2 A
#> ... ...
#> B_resistant2 B
#> B_resistant3 B
#> C_resistant1 C
#> C_resistant2 C
#> C_resistant3 C
# View counts
assay(dds) %>% head()
#> parent1 parent2 parent3 A_resistant1 A_resistant2 A_resistant3
#> DDX11L1 0 0 0 0 0 0
#> WASH7P 3 7 13 7 12 9
#> MIR6859-1 1 1 1 2 1 2
#> MIR1302-2 0 0 0 0 0 0
#> FAM138A 0 0 0 0 0 0
#> OR4F5 0 0 0 0 0 0
#> B_resistant1 B_resistant2 B_resistant3 C_resistant1
#> DDX11L1 0 0 0 0
#> WASH7P 10 9 11 24
#> MIR6859-1 2 2 2 6
#> MIR1302-2 0 0 0 0
#> FAM138A 0 0 0 0
#> OR4F5 0 0 0 0
#> C_resistant2 C_resistant3
#> DDX11L1 0 0
#> WASH7P 7 9
#> MIR6859-1 0 1
#> MIR1302-2 0 0
#> FAM138A 0 0
#> OR4F5 0 0
Filter genes with little or no expression
DESeq2 performs internal filtering to remove genes with low expression, but explicitly filtering extreme cases beforehand can speed up processing.
Stabilize the variance across the mean for visualizing count data
Many analysis methods for multidimensional data rely on variance being consistent across mean values. Raw count data has the greatest variance at high mean values (largest raw values), while log2-transformed data has the greatest variance at low mean values (largest fold-changes). To get around this, DESeq2 provides two different functions - vst()
and rlog()
- that stabilize variance across the mean.
# rlog transform
rld <- rlog(dds,
blind = FALSE)
Sample distances are useful to assess the overall similarity between samples
Sample distance plots are used to assess overall similarity between samples: which samples are similar to each other, which are different? We calculate the euclidean distance between samples using dist()
on the rlog-transformed data to ensure roughly equal contributions from all genes.
# Transpose count matrix and calculate distances using dist()
sampleDists <- assay(rld) %>% t() %>% dist()
# Convert distance dataframe to matrix
sampleDistMatrix <- as.matrix(sampleDists)
# Choose continuous palette from RColorBrewer
colors <- colorRampPalette(rev(brewer.pal(9, "Blues")))(255)
# Plot sample distance heatmap with ComplexHeatmap
Heatmap(sampleDistMatrix, cluster_rows = F, cluster_columns = F,
col = colors)
Principle Component Analysis (PCA) plots
Another method for visualizing relationships in multidimensional data is principle component analysis. PCA identifies clusters of variables (in this case, genes) that explain the majority of the variance present. PCA plots are x-y plots of the first two principle components; they spread the samples out in the two directions that explain most of the differences.
# Generate PCA plot with DESeq2 plotPCA() wrapper
plotPCA(rld, intgroup = "drug") +
scale_color_brewer(palette = "Set1")
Calculating differential expression
DESeq2 provides a wrapper function DESeq()
that takes a DESeqDataSet and calculates differential expression. Progress through analysis steps are printed as the function runs.
# Run DESeq2
dds <- DESeq(dds)
#> estimating size factors
#> estimating dispersions
#> gene-wise dispersion estimates
#> mean-dispersion relationship
#> final dispersion estimates
#> fitting model and testing
Extracting results after differential expression is calculated
DESeq2 results are extracted with the results()
function.
# Extract default results (No Drug vs. Drug A)
res <- results(dds)
# View results summary
summary(res)
#>
#> out of 21178 with nonzero total read count
#> adjusted p-value < 0.1
#> LFC > 0 (up) : 3165, 15%
#> LFC < 0 (down) : 3337, 16%
#> outliers [1] : 25, 0.12%
#> low counts [2] : 3696, 17%
#> (mean count < 2)
#> [1] see 'cooksCutoff' argument of ?results
#> [2] see 'independentFiltering' argument of ?results
# Convert dataframe to tibble for use with tidyverse tools
res_tibble <- as_tibble(res, rownames = "gene")
MA-plots
MA plots - plots of mean expression (x-axis) vs. log2-fold change (y-axis) - are useful for visualizing the distribution of gene expression in your comparisons.
# Plot MA without moderating log2 fold-changes
plotMA(res, ylim = c(-6,6))
# Return results with moderated log2 fold-changes.
res_ma <- lfcShrink(dds, contrast=c("drug","none","A"), res=res, type='normal')
#> using 'normal' for LFC shrinkage, the Normal prior from Love et al (2014).
#>
#> Note that type='apeglm' and type='ashr' have shown to have less bias than type='normal'.
#> See ?lfcShrink for more details on shrinkage type, and the DESeq2 vignette.
#> Reference: https://doi.org/10.1093/bioinformatics/bty895
# Plot MA
plotMA(res_ma, ylim = c(-6,6))
Plotting differential expression results
DESeq2 can plot normalized counts using the plotCounts()
function.
# Plot differential expression
plotCounts(dds,
gene = "ESR1",
intgroup=c("drug"))
# Clean up plot with ggplot2
dds_plot <- plotCounts(dds,
gene = "ESR1",
intgroup=c("drug"),
returnData = TRUE)
ggplot(dds_plot, aes(x = drug, y = count, color = drug)) +
scale_y_log10() +
geom_point(cex = 3)