Introduction and Quality Control
RNA Bioscience Initiative | CU Anschutz
2024-10-21
Greetings experimentalist humans 👋
kristen.wells-wrasman@cuanschutz.edu
RBI Informatics Fellows Office hours
qmd
.Below is the design for the 3’ 10x genomics assay
Other assays may have different locations of there cell barcode and UMI.
These may also be different lengths.
Or even on read 2 (ex Parse bioscience).
Be sure to check what kit was used to prep your data and always perform sanity checks throughout the analysis!
droplet-based scRNA-seq: e.g. 10x Genomics or Drop-Seq
Smart-seq based scRNA-seq: (bulk-RNA-seq on single cells in individual wells/tubes)
CITE-Seq: gene expression + cell surface protein abundance
VDJ-Seq: Gene expression + targeted sequencing of T-Cell and B-Cell receptors
Many others: ATAC, spatial transcriptomics, DNA sequencing, etc. (see Integrative Single cell analysis)
cellranger
from 10x Genomics (STAR)alevin
(Salmon)STAR-solo
(From the STAR developers)$ salmon -h
salmon v1.3.0
Usage: salmon -h|--help or
salmon -v|--version or
salmon -c|--cite or
salmon [--no-version-check] <COMMAND> [-h | options]
Commands:
index : create a salmon index
quant : quantify a sample
alevin : single cell analysis # <------
swim : perform super-secret operation
quantmerge : merge multiple quantifications into a single file
$ls alevin/
alevin.log # run info
featureDump.txt # info on each cell barcode
quants_mat.gz # binary file with UMI counts
quants_mat_cols.txt # genes in count matrix
quants_mat_rows.txt # cell barcodes in count matrix
quants_tier_mat.gz # info about mapping for each gene
whitelist.txt # valid barcodes discovered by alevin
Can generate interactive QC reports using alevinQC
In a typical droplet scRNA-seq experiment 100k - 1M cell barcodes are detected, but only 1-10k cells are loaded
Most of these droplets are “empty” and contain very few reads.
What is the source of these reads in the “empty” droplets?
How do we determine if the data from a particular cell barcode is derived from a single cell?
Fit a curve to the observed data and identify point where first derivative is minimized.
Any barcodes less than the “knee”, test sequences for off-by-one errors against the barcodes above the knee.
Take top half of cells above the knee and train a classifier using multiple criteria (% mapping, % mitochondrial and rRNA reads, duplicate rate, …)
Classify ambiguous cells in lower half into likely cells or not.
Doublets are not clearly identifiable using simple QC metrics, so cannot be reliably removed with filtering with # of UMIs or genes detected.
scran::doublet_cluster
: Compare each cluster to an in silico mix of two other clusters. Get per cluster score of likelihood of being a doublet.
scran::doublet_cell
: Compare each cell to a mix of two other randomly selected cells. Get per cell score of likelihood of being a doublet.
Doublets can also arise due to sample prep, e.g. incomplete generations of a single cell suspension. These doublets are difficult to exclude from the data
Before jumping into the analysis, let’s step back and start running through the exercise for today
Start by loading in the packages
scRNA-seq libraries produce reads from 100,000 - 1,000,000 cell barcodes
A matrix of 20,000 genes x 1,000,000 barcodes is 20 billion values (!).
>95% are zeros due to empty droplets and the low efficiency of the library prep (< 10-20% of RNA captured).
vals <- c(
0, 0, 0, 2, 0,
0, 1, 0, 0, 0,
0, 0, 0, 0, 0,
0, 0, 0, 0, 1,
0, 0, 0, 0, 0
)
m <- matrix(vals, ncol = 5)
m
[,1] [,2] [,3] [,4] [,5]
[1,] 0 0 0 0 0
[2,] 0 1 0 0 0
[3,] 0 0 0 0 0
[4,] 2 0 0 0 0
[5,] 0 0 0 1 0
V1 V2 V3 V4 V5
Min. :0.0 Min. :0.0 Min. :0 Min. :0.0 Min. :0
1st Qu.:0.0 1st Qu.:0.0 1st Qu.:0 1st Qu.:0.0 1st Qu.:0
Median :0.0 Median :0.0 Median :0 Median :0.0 Median :0
Mean :0.4 Mean :0.2 Mean :0 Mean :0.2 Mean :0
3rd Qu.:0.0 3rd Qu.:0.0 3rd Qu.:0 3rd Qu.:0.0 3rd Qu.:0
Max. :2.0 Max. :1.0 Max. :0 Max. :1.0 Max. :0
as
function to convert to a sparse matrix5 x 5 sparse Matrix of class "dtCMatrix"
[1,] . . . . .
[2,] . 1 . . .
[3,] . . . . .
[4,] 2 . . . .
[5,] . . . 1 .
Functions at manipulate matricies (rowMeans
, colSums
, apply
, [
) can be used on sparseMatricies as long as the Matrix
package is loaded.
How can we extract the first 2 rows and first 3 columns of the sparse matrix sm
that we generated above?
Functions at manipulate matricies (rowMeans
, colSums
, apply
, [
) can be used on sparseMatricies as long as the Matrix
package is loaded.
How can we extract the first 2 rows and first 3 columns of the sparse matrix sm
that we generated above?
How can we calculate the sum of the columns of sm
?
How can we calculate the sum of the columns of sm
?
c(TRUE, FALSE)
) or name (if vector is named) [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"
matrix[rows_to_subset, columns_to_subset]
The base R data.frame
and Bioconductor DataFrame
can also be subset with the [
and we can reference individual vectors in a data.frame using $
.
# columns can be generated or overwritten using $ with assignment
mtcars$new_column_name <- "Hello!"
mtcars$wt <- mtcars$wt * 1000
# We can subset using logical vectors
# E.g. filter for rows (cars) with mpg > 20
mtcars[mtcars$mpg > 20, ]
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3215 19.44 1 0 3 1
Merc 240D 24.4 4 146.7 62 3.69 3190 20.00 1 0 4 2
Merc 230 22.8 4 140.8 95 3.92 3150 22.90 1 0 4 2
Fiat 128 32.4 4 78.7 66 4.08 2200 19.47 1 1 4 1
Honda Civic 30.4 4 75.7 52 4.93 1615 18.52 1 1 4 2
Toyota Corolla 33.9 4 71.1 65 4.22 1835 19.90 1 1 4 1
Toyota Corona 21.5 4 120.1 97 3.70 2465 20.01 1 0 3 1
Fiat X1-9 27.3 4 79.0 66 4.08 1935 18.90 1 1 4 1
Porsche 914-2 26.0 4 120.3 91 4.43 2140 16.70 0 1 5 2
Lotus Europa 30.4 4 95.1 113 3.77 1513 16.90 1 1 5 2
Volvo 142E 21.4 4 121.0 109 4.11 2780 18.60 1 1 4 2
new_column_name
Mazda RX4 Hello!
Mazda RX4 Wag Hello!
Datsun 710 Hello!
Hornet 4 Drive Hello!
Merc 240D Hello!
Merc 230 Hello!
Fiat 128 Hello!
Honda Civic Hello!
Toyota Corolla Hello!
Toyota Corona Hello!
Fiat X1-9 Hello!
Porsche 914-2 Hello!
Lotus Europa Hello!
Volvo 142E Hello!
tximport
has methods for importing the binary data from alevinquants_mat.gz
file.lapply
, purrr::map
, a for
loop).eds
package was installed which greatly speeds up the loading of the matrix.We will load in data from a 10x Genomics scRNA-seq library generated from human periperhal blood mononuclear cells (PMBCS).
library(tximport)
tx <- tximport(
here("data/block-rna/scrna/pbmc/alevin/quants_mat.gz"),
type = "alevin"
)
names(tx)
[1] "abundance" "counts" "countsFromAbundance"
tx
is a list with 3 elements, abundance
, counts
, and countsFromAbundance
. Let’s look at the counts element
6 x 3 sparse Matrix of class "dgCMatrix"
GCTGCAGTCCGATCTC ACTATGGAGGTCCCTG ATTTCTGTCTCTATGT
ENSG00000243485 . . .
ENSG00000284332 . . .
ENSG00000237613 . . .
ENSG00000268020 . . .
ENSG00000290826 . . .
ENSG00000240361 . . .
Here you can see that tx$counts
is a sparse matrix that is genes (rows) by cells (columns).
How many barcodes are in tx$counts
? How many genes?
What fraction of the matrix is non-zero? We can use the nnzero
function from the Matrix
package check
Key resource for single cell analysis in Bioconductor: Orchestrating Single Cell Analysis
SingleCellExperiment
is the core datastructure for storing single cell data.
scran
provides algorithms for low-level processing of single cell data.
scater
provides plotting, data transformation, and quality control functionality
A SingleCellExperiment
object can be created from our sparse matrix using the SingleCellExperiment()
function.
class: SingleCellExperiment
dim: 62266 6075
metadata(0):
assays(1): counts
rownames(62266): ENSG00000290825 ENSG00000223972 ... ENSG00000210195
ENSG00000210196
rowData names(0):
colnames(6075): GCTGCAGTCCGATCTC ACTATGGAGGTCCCTG ... ACGTAGGGTGACAGCA
TCTCAGCTCGCCGAAC
colData names(0):
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
The SingleCellExperiment
object stores the gene x cell count matrix within assays()
.
List of length 1
names(1): counts
4 x 4 sparse Matrix of class "dgCMatrix"
GCTGCAGTCCGATCTC ACTATGGAGGTCCCTG ATTTCTGTCTCTATGT
ENSG00000290825 . . .
ENSG00000223972 . . .
ENSG00000227232 . . .
ENSG00000278267 . . .
TATCTGTAGGTGATAT
ENSG00000290825 .
ENSG00000223972 .
ENSG00000227232 .
ENSG00000278267 .
4 x 4 sparse Matrix of class "dgCMatrix"
GCTGCAGTCCGATCTC ACTATGGAGGTCCCTG ATTTCTGTCTCTATGT
ENSG00000290825 . . .
ENSG00000223972 . . .
ENSG00000227232 . . .
ENSG00000278267 . . .
TATCTGTAGGTGATAT
ENSG00000290825 .
ENSG00000223972 .
ENSG00000227232 .
ENSG00000278267 .
colData
colData()
DataFrame
) which has similar semantics and functionality to a base R data.frame.DataFrame with 6075 rows and 0 columns
# add a sample annotation
colData(sce)$cell_source <- "PBMC"
# equivalent approach using $
sce$cell_source <- "PBMC"
colData(sce)
DataFrame with 6075 rows and 1 column
cell_source
<character>
GCTGCAGTCCGATCTC PBMC
ACTATGGAGGTCCCTG PBMC
ATTTCTGTCTCTATGT PBMC
TATCTGTAGGTGATAT PBMC
AGCCAGCCAAAGCACG PBMC
... ...
CATCCCAAGTACTCGT PBMC
CTCCTCCCATGAAGCG PBMC
AGTTCCCCATGTCAGT PBMC
ACGTAGGGTGACAGCA PBMC
TCTCAGCTCGCCGAAC PBMC
rowData()
.rowData(sce)$total_gene_counts <- rowSums(counts(sce))
rowData(sce)$n_cells_expr <- rowSums(counts(sce) > 0)
rowData(sce)
DataFrame with 62266 rows and 2 columns
total_gene_counts n_cells_expr
<numeric> <integer>
ENSG00000290825 65.594607 43
ENSG00000223972 0.000000 0
ENSG00000227232 5.506349 24
ENSG00000278267 0.000000 0
ENSG00000243485 0.333333 1
... ... ...
ENSG00000198695 2054 1509
ENSG00000210194 0 0
ENSG00000198727 274421 5704
ENSG00000210195 0 0
ENSG00000210196 0 0
dplyr
verbs do not work with SingleCellExperiment
dplyr
verbs do not work with SingleCellExperiment
# subset to data from first 4 genes and cells
sce[1:4, 1:4]
# subset to cells from PBMC cells
sce[, sce$cell_source == "PBMC"]
genes_to_keep <- c("ENSG00000223972", "ENSG00000210195", "ENSG00000210196")
sce[genes_to_keep, ]
cells_to_keep <- c("ACTATGGAGGTCCCTG", "GCTGCAGTCCGATCTC", "TCTCAGCTCGCCGAAC")
sce[, cells_to_keep]
ncol()
: # of cells
nrow()
: # of gene
dims()
: # of genes and cells
rownames()
: rownames in matrices (e.g. genes)
colnames()
: colnames in matrices (e.g. cells)
cbind()
: combine multiple SingleCellExperiments by column
rbind()
: combine multiple SingleCellExperiments by row
ah <- AnnotationHub()
# download ensembl database
ens_db <- ah[["AH113665"]]
gene_names <- mapIds(ens_db,
keys = rownames(sce),
keytype = "GENEID",
column = "SYMBOL"
)
rowData(sce)$gene <- gene_names
rowData(sce)$gene_id <- rownames(sce)
rowData(sce)
DataFrame with 62266 rows and 4 columns
total_gene_counts n_cells_expr gene gene_id
<numeric> <integer> <character> <character>
ENSG00000290825 65.594607 43 DDX11L2 ENSG00000290825
ENSG00000223972 0.000000 0 DDX11L1 ENSG00000223972
ENSG00000227232 5.506349 24 WASH7P ENSG00000227232
ENSG00000278267 0.000000 0 MIR6859-1 ENSG00000278267
ENSG00000243485 0.333333 1 MIR1302-2HG ENSG00000243485
... ... ... ... ...
ENSG00000198695 2054 1509 MT-ND6 ENSG00000198695
ENSG00000210194 0 0 MT-TE ENSG00000210194
ENSG00000198727 274421 5704 MT-CYB ENSG00000198727
ENSG00000210195 0 0 MT-TT ENSG00000210195
ENSG00000210196 0 0 MT-TP ENSG00000210196
, NA
, or duplicateduniquifyFeatureNames()
is a convenience function that will rename gene symbols that are NA
or duplicated values to the ensembl ID or a combination of gene symbol and ensembl IDNext we perform some filtering and quality control to remove low expression genes and poor quality cells.
Our SingleCellExperiment has 62266 genes in the matrix. Most of these are not expressed. We want to exclude these genes as they won’t provide any useful data for the analysis.
# exclude genes expressed in fewer than 10 cells (~ 1% of cells)
rowData(sce)$n_cells <- rowSums(counts(sce) > 0)
sce <- sce[rowData(sce)$n_cells >= 10, ]
sce
class: SingleCellExperiment
dim: 20858 6075
metadata(0):
assays(1): counts
rownames(20858): DDX11L2_ENSG00000290825 WASH7P ... MT-ND6 MT-CYB
rowData names(5): total_gene_counts n_cells_expr gene gene_id n_cells
colnames(6075): GCTGCAGTCCGATCTC ACTATGGAGGTCCCTG ... ACGTAGGGTGACAGCA
TCTCAGCTCGCCGAAC
colData names(2): cell_source total_counts
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
To exclude low-quality cells we will use the following metrics:
A low number of counts, a low number of detected genes, and a high percentage of mitochondrial counts suggests that the cell had a broken membrane and the cytoplasmic mRNA leaked out.
To calculate these metrics we can use addPerCellQCMetrics
from scater. Mitochondrial genes are named with a common “MT-” prefix (e.g. MT-CO2, MT-ATP6, MR-RNR2), which we can use to identify them.
# identify subset of genes that are from mitochondrial genome
is_mito <- startsWith(rowData(sce)$gene, "MT-")
sce <- addPerCellQCMetrics(sce, subsets = list(Mito = is_mito))
colData(sce)
DataFrame with 6075 rows and 8 columns
cell_source total_counts sum detected subsets_Mito_sum
<character> <numeric> <numeric> <integer> <numeric>
GCTGCAGTCCGATCTC PBMC 31924 31908.0 5718 2668.49
ACTATGGAGGTCCCTG PBMC 35845 35801.5 6073 3963.23
ATTTCTGTCTCTATGT PBMC 31788 31758.3 6377 2496.33
TATCTGTAGGTGATAT PBMC 32025 31998.0 6260 2961.15
AGCCAGCCAAAGCACG PBMC 29882 29856.0 5746 2660.44
... ... ... ... ... ...
CATCCCAAGTACTCGT PBMC 151 151.000 132 12
CTCCTCCCATGAAGCG PBMC 133 132.500 131 6
AGTTCCCCATGTCAGT PBMC 173 172.667 166 8
ACGTAGGGTGACAGCA PBMC 255 253.000 217 13
TCTCAGCTCGCCGAAC PBMC 144 144.000 149 4
subsets_Mito_detected subsets_Mito_percent total
<integer> <numeric> <numeric>
GCTGCAGTCCGATCTC 13 8.36307 31908.0
ACTATGGAGGTCCCTG 15 11.07002 35801.5
ATTTCTGTCTCTATGT 15 7.86038 31758.3
TATCTGTAGGTGATAT 15 9.25417 31998.0
AGCCAGCCAAAGCACG 15 8.91092 29856.0
... ... ... ...
CATCCCAAGTACTCGT 5 7.94702 151.000
CTCCTCCCATGAAGCG 5 4.52830 132.500
AGTTCCCCATGTCAGT 6 4.63320 172.667
ACGTAGGGTGACAGCA 6 5.13834 253.000
TCTCAGCTCGCCGAAC 2 2.77778 144.000
We can use the plotColData()
function from scater to plot various metrics (as a ggplot2 object).
We can use the plotColData()
function from scater to plot various metrics (as a ggplot2 object).
We can use the plotColData()
function from scater to plot various metrics (as a ggplot2 object).
We can use the plotColData()
function from scater to plot various metrics (as a ggplot2 object).
We can also extract colData
as a dataframe to make custom ggplot2
plots
How many cells pass these criteria?
How many cells pass these criteria?
Lastly we can subset the SingleCellExperiment
to exclude the low-quality cells.
class: SingleCellExperiment
dim: 20858 4565
metadata(0):
assays(1): counts
rownames(20858): DDX11L2_ENSG00000290825 WASH7P ... MT-ND6 MT-CYB
rowData names(5): total_gene_counts n_cells_expr gene gene_id n_cells
colnames(4565): GCTGCAGTCCGATCTC ACTATGGAGGTCCCTG ... CTGGATCCACCGTACG
TACAACGGTCTCGCGC
colData names(9): cell_source total_counts ... total pass_qc
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):
runPCA
from scater to perform PCA.scran
Description of algorithim: Lun AT, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016 Apr 27;17:75. doi: 10.1186/s13059-016-0947-7. PMID: 27122128; PMCID: PMC4848819.
Plot the PCA with normalization
PC1 no longer correlates with UMI counts.
eBook: Orchestrating single cell analyses with Bioconductor eBook Publication
Review: Current best practices for analysis
Blog: Single Cell Thoughts
Blog: What do you mean “heterogeneity”
Course website: https://rnabioco.github.io/molb-7950