Clustering, Dimensionality reduction, and cell-type annotation
RNA Bioscience Initiative | CU Anschutz
2024-10-21
Greetings experimentalist humans 👋
kristen.wells-wrasman@cuanschutz.edu
RBI Informatics Fellows Office hours
qmd
.Normalize and log transform UMI counts to correct for sequencing depth.
Select genes with high variance to use for clustering and UMAP generation
Use PCA to reduce the size of the dataset to ~20-50 dimensions
Use UMAP or tSNE to further reduce the PCA matrix into 2D for visualization
Use clustering on PCA matrix to identify clusters of related cells.
Find marker genes that are specifically expressed in each cluster.
Compare the gene expression profile of each cluster to public data to help annotate the cell type.
# normalize data
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, clusters=clusters)
sce <- logNormCounts(sce)
# get variable genes
dec <- modelGeneVarByPoisson(sce)
top <- getTopHVGs(dec, prop=0.1)
# get PCA and UMAP
sce <- runPCA(sce, subset_row = top)
sce <- runUMAP(sce, dimred = "PCA")
# cluster cells
sce$clusters <- clusterCells(sce, use.dimred = "PCA")
# get marker genes
mrks <- scoreMarkers(sce, sce$clusters)
...
# load data
library(tximport)
tx <- tximport(
here("data/block-rna/scrna/pbmc/alevin/quants_mat.gz"),
type = "alevin"
)
# setup gene ids
sce <- SingleCellExperiment(list(counts = tx$counts))
ah <- AnnotationHub()
ens_db <- ah[["AH113665"]]
gene_names <- mapIds(ens_db,
keys = rownames(sce),
keytype = "GENEID",
column = "SYMBOL"
)
rowData(sce)$gene <- gene_names
rowData(sce)$gene_id <- rownames(sce)
rownames(sce) <- uniquifyFeatureNames(
rowData(sce)$gene_id,
rowData(sce)$gene
)
# drop non/low expressed genes
rowData(sce)$n_cells <- rowSums(counts(sce) > 0)
sce <- sce[rowData(sce)$n_cells >= 10, ]
# basic QC
is_mito <- startsWith(rowData(sce)$gene, "MT-")
sce <- addPerCellQCMetrics(sce, subsets = list(Mito = is_mito))
sce$pass_qc <- sce$subsets_Mito_percent < 20 & sce$sum > 1000 & sce$detected > 500
sce <- sce[, sce$pass_qc]
quickCluster()
: Crude clustering to group related cells into groups with similar expression profiles
computeSumFactors()
: Pool counts across clusters to establish an average cell profile for each cluster. Then use deconvolution to estimate a cell-specific scaling factor for normalization
logNormCounts()
: Apply scaling factors (size factors) to counts and log transform with a pseudocount.
set.seed(20231023)
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, clusters = clusters)
sce <- logNormCounts(sce)
logcounts(sce)[50:60, 1:4]
11 x 4 sparse Matrix of class "dgCMatrix"
GCTGCAGTCCGATCTC ACTATGGAGGTCCCTG ATTTCTGTCTCTATGT TATCTGTAGGTGATAT
CPTP . 0.2645668 . .
TAS1R3 . . . .
DVL1 . 0.2645668 . .
MXRA8 . . . .
AURKAIP1 0.3500082 1.9327619 0.2923841 0.7922116
CCNL2 0.6314635 . 0.7432885 0.5731981
MRPL20-AS1 0.3500082 0.2645668 0.2923841 .
MRPL20 0.8668712 1.1425125 0.9249737 0.7922116
RN7SL657P . . . .
MRPL20-DT . . . 0.3148810
ATAD3C . . . .
Genes that have high variance across cells are genes that tend to be differentially expressed between cell types
Low variance genes are usually low-expressed or “house-keeping” genes whose expression will not help us distinguish between cell populations
Including these genes could potentially introduce more technical variation rather then helpful biological variation.
Keeping uninteresting genes will increase the computational burden and likely either not improve or be deleterious to the clustering.
modelGeneVarByPoisson()
: Fit curve, using Poisson distribution, to variance against the mean expression distribution. Estimate technical (Poisson estimate) and biological (the residuals from the Poisson) variance for each gene.
getTopHVGs()
: Filter output from modelGeneVarByPoisson
to select top variable genes. Use prop
to identify the proportion of genes to call highly variable or n
to identify a number. Often starting between 1000-2000 is best.
modelGeneVarByPoisson()
: Fit curve, using Poisson distribution, to variance against the mean expression distribution. Estimate technical (Poisson estimate) and biological (the residuals from the Poisson) variance for each gene.
getTopHVGs()
: Filter output from modelGeneVarByPoisson
to select top variable genes. Use prop
to identify the proportion of genes to call highly variable or n
to identify a number. Often starting between 1000-2000 is best.
We can plot mean expression against variance to see the trend and visualize the top variable genes. These are often marker genes of specific cell populations.
Dimensionality reduction is the concept of transforming high dimensional data and representing it with a smaller set of dimensions.
We can reduce the # of dimensions without loosing much information because many features (genes) are highly correlated which can be approximated with fewer dimensions.
This is analogous to reducing the gene expression data into a set of metagenes that represents the expression of many correlated genes.
Using fewer dimensions makes computation much quicker and as we will see will reorder the data in a manner that still captures the heterogeneity in the data.
We will use PCA to reduce the dimensionality of the data from 20,000 genes to ~10-50 principle components.
PCA takes a matrix of features (genes) and samples (cells) and transforms the matrix into a new set of features known as a principal components. A principal component is a linear combination of the original genes that is oriented to capture variance in the dataset.
PC1
is a vector through the dataset oriented in a direction that spans the most variation in the data. The second component is another linear combination of the original variables but is uncorrelated to the first component (points in an orthogonal direction).
In a geometric sense, PCA produces a new coordinate system whereby instead of the axes being each gene, the axes are each a PC and the first axis points through the largest spread in the data.
In this two dimensional space, PCA minimizes the distances from a line and every point in the x and y direction (think of a regression line). For a great walk through of how PCA works, see section 7.3 in Modern Statistics for Modern Biology
What if we had more dimensions that are uninteresting variance/noise? How does this impact the PCA?
gene_1 gene_2 gene_noise_1 gene_noise_2 gene_noise_3
1 -2.995217 -2.123704 0.089215287 -0.40819808 -0.48037882
2 -2.821900 -2.933205 -0.339377410 -0.03582506 0.41886793
3 -2.278693 -2.375272 0.131539067 -0.42624492 0.28643147
4 -2.167279 -2.535480 0.098178061 -0.20509451 -0.06692551
5 -1.528019 -1.310610 0.007170167 0.04159706 0.03533156
runPCA()
: computes an approximate truncated PCA, returning 50 PCs by default.
reducedDim(sce, "PCA")
: function to access or assign reduced dimensionality results
plotPCA()
: plot 2 or more PC components
plotReducedDim()
: plot 2 or more dimensions or arbitrary reducedDim()
matrix
runPCA()
: computes an approximate truncated PCA, returning 50 PCs by default.
reducedDim(sce, "PCA")
: function to access or assign reduced dimensionality results
plotPCA()
: plot 2 or more PC components
plotReducedDim()
: plot 2 or more dimensions or arbitrary reducedDim()
matrix
In general the # of PCs to use for downstream steps depends on the complexity and heterogeneity in the data, as well as the biological question. For this analysis we will retain the top 30, but you should explore how the downstream steps are affected by including fewer or more PCs.
In practice picking fewer PCs will identify fewer subpopulations, and picking more PCs will find more subpopulations, at the expense of potentially increased noise and longer runtime.
PC1 PC2 PC3 PC4
GCTGCAGTCCGATCTC -17.7865879 -5.494428 -3.609731 -2.2119402
ACTATGGAGGTCCCTG -0.0933944 7.195291 2.840164 0.0138008
ATTTCTGTCTCTATGT -17.8824692 -1.533912 -3.546737 0.5637093
TATCTGTAGGTGATAT -16.7533963 -2.843034 -5.289824 -1.3747961
Non-linear dimensionallty reductions are commonly used to project the PCA matrix into 2 dimensions for visualization
This entails trying to reduce the ~10-50 PCA dimensions into a 2 dimensional space.
To do this the algorithm performs non-linear transformations that distort the true distances between cells
Attempts to balance global and local differences to produce visualization that capture variation in data.
runUMAP()
and runTSNE()
We can also color by any gene expression
We use clustering to assign cells to discrete cell populations. This is a simplification of the true complexity of the underlying data.
Clustering is a data analysis tool rather than a result. We can increase, decrease, merge, or subset clusters at our whim by changing clustering parameters, the # of PCs used, the # of variable genes, and the particular cell subsets analyzed.
There is no correct or valid number of clusters. It depends on what biological question you are trying to explore and at what granularity you want to ask the question.
[A]sking for an unqualified “best” clustering is akin to asking for the best magnification on a microscope without any context - OSCA book
Approach:
clusterCells()
Note that the clustering is performed on the PCA matrix not the 2D UMAP plot
The parameters of the clustering can be changed by changing the NNGraphParam
. Here are the defaults:
library(bluster)
set.seed(10101010)
clusters <- clusterCells(sce,
use.dimred = "PCA",
BLUSPARAM = NNGraphParam(
k = 10,
type = "rank",
cluster.fun = "walktrap"
)
)
table(clusters)
clusters
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
766 257 552 122 369 211 141 1409 52 74 55 63 44 47 103 97
17 18 19
163 24 16
The parameters of the clustering can be changed by changing the NNGraphParam
. Here are the defaults:
Generally increasing the # of neighbors will decrease the number of clusters and vice versa.
What clusters are the genes S100A8, CD79A and CD3G most highly expressed in? We can use the plotExpression()
function to visualize the expression data in each cluster.
Clustering algorithms produce clusters, even if there isn’t anything meaningfully different between cells. Determining the optimal number of clusters can be tricky and also dependent on the biological question.
Some guidelines:
Cluster the data into a small number of clusters to identify cell types, then recluster to generate additional clusters to define sub-populations of cell types.
To determine if the data is overclustered examine differentially expressed genes between clusters. If the clusters have few or no differentially expressed genes then the data is likely overclustered. Similar clusters can be merged post-hoc if necessary as sometimes it is difficult to use 1 clustering approach for many diverse cell populations. Using a reference-based approach to name cell types will also often merge similar clusters.
A hybrid approach is to first annotate major cell types using a small # of clusters (e.g. B-cell, T-cell, Myeloid, etc.), then subset the SingleCellExperiment object for each cluster and perform additional clustering to obtain ‘subclusters’ (e.g. b-cell-subset 1, b-cell-subset 2, t-cell-subset-1, etc.)
Clustering algorithms produce clusters, even if there isn’t anything meaningfully different between cells. Determining the optimal number of clusters can be tricky and also dependent on the biological question.
Some guidelines:
Cluster the data into a small number of clusters to identify cell types, then recluster to generate additional clusters to define sub-populations of cell types.
To determine if the data is overclustered examine differentially expressed genes between clusters. If the clusters have few or no differentially expressed genes then the data is likely overclustered. Similar clusters can be merged post-hoc if necessary as sometimes it is difficult to use 1 clustering approach for many diverse cell populations. Using a reference-based approach to name cell types will also often merge similar clusters.
A hybrid approach is to first annotate major cell types using a small # of clusters (e.g. B-cell, T-cell, Myeloid, etc.), then subset the SingleCellExperiment object for each cluster and perform additional clustering to obtain ‘subclusters’ (e.g. b-cell-subset 1, b-cell-subset 2, t-cell-subset-1, etc.)
To identify marker genes we will need to decide of cutoffs based on effect sizes. There are many reported by scoreMarkers().
[1] "self.average" "other.average" "self.detected"
[4] "other.detected" "mean.logFC.cohen" "min.logFC.cohen"
[7] "median.logFC.cohen" "max.logFC.cohen" "rank.logFC.cohen"
[10] "mean.AUC" "min.AUC" "median.AUC"
[13] "max.AUC" "rank.AUC" "mean.logFC.detected"
[16] "min.logFC.detected" "median.logFC.detected" "max.logFC.detected"
[19] "rank.logFC.detected"
What is a marker gene? * Differentially expressed between cluster 1 and all others? * This could be too strict, what if the same gene marks clusters 1-8 but not 9? We would miss that as a marker * Differentially expressed between some or any pairwise clusters
The output of scoreMarkers() allows you to filter the data in many different ways depending on what type of marker you want to identify. Importantly it doesn’t hide the complexity of this data. I highly recommend reading the chapter on marker gene detection from the OSCA book to understand how best to identify markers depending on the dataset you are analyzing.
AUC
to rank markers
mean
, min
, median
, max
, or rank
for potential rankingParaphrasing from the OSCA book: * The most obvious summary statistic is the mean. For cluster X, a large mean effect size (>0.5 for the AUCs) indicates that the gene is upregulated in X compared to the average of the other groups. * Another summary statistic is the median, where a large value indicates that the gene is upregulated in X compared to most (>50%) other clusters. * The minimum value (min.) is the most stringent summary for identifying upregulated genes, as a large value indicates that the gene is upregulated in X compared to all other clusters. The maximum value (max.) is the least stringent summary for identifying upregulated genes, as a large value can be obtained if there is strong upregulation in X compared to any other cluster. The minimum rank, a.k.a., “min-rank” (rank.*) is the smallest rank of each gene across all pairwise comparisons. Specifically, genes are ranked within each pairwise comparison based on decreasing effect size, and then the smallest rank across all comparisons is reported for each gene. If a gene has a small min-rank, we can conclude that it is one of the top upregulated genes in at least one comparison of X to another cluster.
Plotting top genes ranked by mean.AUC
for cluster 1.
Next we will extract the top N markers from each cluster, ranked on mean.AUC
, then plot the average expression of these markers in each cluster as a heatmap.
n_top_markers <- 5
top_markers <- purrr::map_dfr(mrks, ~ {
.x |>
as.data.frame() |>
tibble::rownames_to_column("gene") |>
dplyr::filter(mean.AUC > 0.5) |>
dplyr::arrange(desc(mean.AUC)) |>
dplyr::slice(1:n_top_markers)
}, .id = "cluster")
top_markers[1:10, 1:2]
cluster gene
1 1 S100A9
2 1 S100A8
3 1 LYZ
4 1 MNDA
5 1 FCN1
6 2 BANK1
7 2 CD79A
8 2 MS4A1
9 2 TNFRSF13C
10 2 RALGPS2
Next we will extract the top N markers from each cluster, ranked on mean.AUC
, then plot the average expression of these markers in each cluster as a heatmap.
clustifyr
, SingleR
, SignacX
, and Azimuth
We we use clustifyr
which was developed by a previous RBI fellow Rui Fu. There are also many other methods (e.g. SingleR
)
clustifyr
compares the average gene expression in each cluster to a reference matrix that contains average gene signatures of reference cell types.In order to compare our dataset we need to use a publicly available reference dataset. Thankfully many datasets have been organized into a separate package: clustifyrdatahub. This is an extension of the ExperimentHub
package on bioconductor that allows you to easily download and cache external datasets.
library(clustifyr)
library(clustifyrdatahub)
# get reference dataset derived from a microarray experiment of sorted immune cell types
ref_hema_microarray()[1:5, 1:5]
Basophils CD4+ Central Memory CD4+ Effector Memory CD8+ Central Memory
DDR1 6.084244 5.967502 5.933039 6.005278
RFC2 6.280044 6.028615 6.047005 5.992979
HSPA6 6.535444 5.811475 5.746326 5.928349
PAX8 6.669153 5.896401 6.118577 6.270870
GUCA1A 5.239230 5.232116 5.206960 5.227415
CD8+ Effector Memory
DDR1 5.895926
RFC2 5.942426
HSPA6 5.942670
PAX8 6.323922
GUCA1A 5.090882
res <- clustify(
sce, # SingleCellExperiment object
ref_mat = ref_hema_microarray(), # cell type reference data
cluster_col = "clusters", # column in metadata with clusters
# don't add to SingleCellExperiment object, just return results
obj_out = FALSE,
# use variable genes for comparison
query_genes = top,
)
res[1:12, 1:4]
Basophils CD4+ Central Memory CD4+ Effector Memory CD8+ Central Memory
1 0.5370093 0.4111325 0.4176017 0.3936302
10 0.5945973 0.4591039 0.4655590 0.4433302
11 0.4386922 0.7151960 0.7115503 0.6798845
12 0.6333579 0.5402749 0.5289964 0.5160502
13 0.3935435 0.6535071 0.6516310 0.6321161
14 0.4656631 0.6099487 0.6324938 0.6493463
15 0.6258199 0.5493260 0.5378138 0.5284327
16 0.4883762 0.7668150 0.7549072 0.7346618
17 0.6277840 0.5580115 0.5459105 0.5343926
18 0.5650205 0.5267830 0.5237343 0.5021841
19 0.3379409 0.3154402 0.3177152 0.3003928
2 0.6210041 0.5744730 0.5678439 0.5518513
We can use cor_to_call
to pull out the highest correlation for each cluster
cor_to_call(res) # assign cluster to highest correlation cell type (above a threshold). Cell types lower than a threshold will be assigned as unassigned.
# A tibble: 19 × 3
# Groups: cluster [19]
cluster type r
<chr> <chr> <dbl>
1 11 CD4+ Central Memory 0.715
2 13 CD4+ Central Memory 0.654
3 5 CD4+ Effector Memory 0.769
4 3 CD8+ Effector Memory 0.747
5 2 Mature B-cell class switched 0.772
6 14 Mature NK cell_CD56- CD16- CD3- 0.715
7 6 Mature NK cell_CD56+ CD16+ CD3- 0.729
8 7 Mature NK cell_CD56+ CD16+ CD3- 0.735
9 19 Megakaryocyte 0.462
10 1 Monocyte 0.756
11 4 Monocyte 0.713
12 10 Myeloid Dendritic Cell 0.743
13 9 Myeloid Dendritic Cell 0.641
14 12 Naïve B-cells 0.778
15 15 Naïve B-cells 0.794
16 17 Naïve B-cells 0.797
17 16 Naive CD4+ T-cell 0.781
18 8 Naive CD4+ T-cell 0.765
19 18 Plasmacytoid Dendritic Cell 0.673
Or we can insert the classification results into the SingleCellExperiment object directly which will be called type
.
You can share your SingleCellExperiment objects with collaborators by saving the object as an .rds
file.
If you plan on publishing the data, then the best practices are to upload the UMI count matrix and your colData() data.frame containing the cell type annotations.
To save the UMI count matrix use write10xcounts() from the DropletUtils Bioconductor package.
When saving the colData()
it is also a nice gesture to include the UMAP coordinates that you used for the main visualizations in your manuscript. The clustering and UMAP coordinates are very hard to reproduce because of the non-deterministic elements of the algorithms.
Course website: https://rnabioco.github.io/molb-7950