finding markers and differential expression analysis
After clustering, differential expression testing (DE analysis, similar to bulk RNA seq) finds gene signatures of each cluster/cell population, to give more insights into biology.
many proposed methods
(for Seurat, we recommend default “wilcox” or “t”, good balances between speed and accuracy)
Soneson and Robinson, 2018
important considerations
keep things simple first and look at the results (try not aligning, not regressing)
determine what variables to regress out (batch, nCount_RNA, percent_mito, cell cycle, if needed) during scaling (note this only affects dimension reduction and clustering)
library(Seurat)
library(tidyverse)
# load data from saved RDS
sobj <- readRDS("_vignettes/data/filtered_sobj.rds")
# log normalize
sobj_l <- NormalizeData(sobj) # <- skip this step if using scran normalization
sobj_l <- ScaleData(sobj_l,
vars.to.regress = c("nCount_RNA",
"percent_mito"))
# alternatively, sctransform is a one step normalization and scaling process
sobj_sc <- suppressWarnings(SCTransform(object = sobj, # sctransform has some unhelpful warnings
vars.to.regress = c("percent_mito"), # already corrects for transcript number
verbose = FALSE))
use normalized data for DE (slot is dependent on normalization method, also don’t use integrated assay)
# before normalization, "data" slot is the same as counts
sobj@assays$RNA@data[101:105,101:105] # raw counts
#> 5 x 5 sparse Matrix of class "dgCMatrix"
#> ATCTTGACCTCCCA-1 ATCTTTCTTGTCCC-1 ATGAAGGACCTGTC-1
#> PGD 2 . .
#> APITD1 . . .
#> DFFA . . .
#> PEX14 . . .
#> CASZ1 . . .
#> ATGACGTGATCGGT-1 ATGAGAGAAGTAGA-1
#> PGD . .
#> APITD1 . .
#> DFFA . .
#> PEX14 . .
#> CASZ1 . .
# after log normalization, results are stored in "data" slot
sobj_l@assays$RNA@data[101:105,101:105] # normalized
#> 5 x 5 sparse Matrix of class "dgCMatrix"
#> ATCTTGACCTCCCA-1 ATCTTTCTTGTCCC-1 ATGAAGGACCTGTC-1
#> PGD 2.727586 . .
#> APITD1 . . .
#> DFFA . . .
#> PEX14 . . .
#> CASZ1 . . .
#> ATGACGTGATCGGT-1 ATGAGAGAAGTAGA-1
#> PGD . .
#> APITD1 . .
#> DFFA . .
#> PEX14 . .
#> CASZ1 . .
# after sctransform, results are stored in new assay "SCT"
sobj_sc@assays$SCT@data[101:105,101:105] # normalized
#> 5 x 5 sparse Matrix of class "dgCMatrix"
#> ATCTTGACCTCCCA-1 ATCTTTCTTGTCCC-1 ATGAAGGACCTGTC-1
#> PGD 1.386294 . .
#> APITD1 . . .
#> DFFA . . .
#> PEX14 . . .
#> CASZ1 . . .
#> ATGACGTGATCGGT-1 ATGAGAGAAGTAGA-1
#> PGD . .
#> APITD1 . .
#> DFFA . .
#> PEX14 . .
#> CASZ1 . .
sobj_sc@assays$RNA@data[101:105,101:105] # still same as counts, make sure not to use this!
#> 5 x 5 sparse Matrix of class "dgCMatrix"
#> ATCTTGACCTCCCA-1 ATCTTTCTTGTCCC-1 ATGAAGGACCTGTC-1
#> PGD 2 . .
#> APITD1 . . .
#> DFFA . . .
#> PEX14 . . .
#> CASZ1 . . .
#> ATGACGTGATCGGT-1 ATGAGAGAAGTAGA-1
#> PGD . .
#> APITD1 . .
#> DFFA . .
#> PEX14 . .
#> CASZ1 . .
Note that marker genes found is very dependent on clustering and the compared populations
P‐values are inflated, due to the cyclic nature of identifying the same variable genes as markers, which were used for dimension reduction and clustering. However, the ranking of genes based on P‐values is unaffected.
Sidebar: How to regress out cell cycle heterogeneity (and do you need to?)
# many options to assess cell cycle effects on gene expression
# 1. ridge plot on key genes (from Seurat2 tutorial)
RidgePlot(sobj_l,
features = c("MKI67", # smoothing eaves plots empty if no significant amount of cells express it
"MCM6",
"TOP2A",
"PCNA",
"ZFP36"), # how actual robust gene expression should look
ncol = 2)
# see https://satijalab.org/seurat/v3.0/cell_cycle_vignette.html for a case where some cells are actively cycling
# 2. assess phase and PCA
# Seurat stores a list of cell cycle specific genes for humans
s.genes <- Seurat::cc.genes$s.genes # obviously unsuitable for nonhuman
g2m.genes <- Seurat::cc.genes$g2m.genes
s.genes
#> [1] "MCM5" "PCNA" "TYMS" "FEN1" "MCM2" "MCM4"
#> [7] "RRM1" "UNG" "GINS2" "MCM6" "CDCA7" "DTL"
#> [13] "PRIM1" "UHRF1" "MLF1IP" "HELLS" "RFC2" "RPA2"
#> [19] "NASP" "RAD51AP1" "GMNN" "WDR76" "SLBP" "CCNE2"
#> [25] "UBR7" "POLD3" "MSH2" "ATAD2" "RAD51" "RRM2"
#> [31] "CDC45" "CDC6" "EXO1" "TIPIN" "DSCC1" "BLM"
#> [37] "CASP8AP2" "USP1" "CLSPN" "POLA1" "CHAF1B" "BRIP1"
#> [43] "E2F8"
g2m.genes
#> [1] "HMGB2" "CDK1" "NUSAP1" "UBE2C" "BIRC5" "TPX2"
#> [7] "TOP2A" "NDC80" "CKS2" "NUF2" "CKS1B" "MKI67"
#> [13] "TMPO" "CENPF" "TACC3" "FAM64A" "SMC4" "CCNB2"
#> [19] "CKAP2L" "CKAP2" "AURKB" "BUB1" "KIF11" "ANP32E"
#> [25] "TUBB4B" "GTSE1" "KIF20B" "HJURP" "CDCA3" "HN1"
#> [31] "CDC20" "TTK" "CDC25C" "KIF2C" "RANGAP1" "NCAPD2"
#> [37] "DLGAP5" "CDCA2" "CDCA8" "ECT2" "KIF23" "HMMR"
#> [43] "AURKA" "PSRC1" "ANLN" "LBR" "CKAP5" "CENPE"
#> [49] "CTCF" "NEK2" "G2E3" "GAS2L3" "CBX5" "CENPA"
# score and phase call is added to metadata
sobj_l <- CellCycleScoring(sobj_l,
s.features = s.genes,
g2m.features = g2m.genes,
set.ident = FALSE)
sobj_l@meta.data %>% head()
#> orig.ident nCount_RNA nFeature_RNA percent_mito
#> AAACATACAACCAC-1 control 2419 779 3.017776
#> AAACATTGAGCTAC-1 control 4901 1350 3.795144
#> AAACGCTGACCAGT-1 control 2174 781 3.817847
#> AAACGCTGGTTCTT-1 control 2259 789 3.098716
#> AAAGTTTGATCACG-1 control 1265 441 3.478261
#> AAATCATGACCACA-1 control 4125 1365 4.581818
#> S.Score G2M.Score Phase
#> AAACATACAACCAC-1 0.10140023 -0.05909799 S
#> AAACATTGAGCTAC-1 -0.03669937 -0.05460303 G1
#> AAACGCTGACCAGT-1 -0.04334459 -0.04401501 G1
#> AAACGCTGGTTCTT-1 -0.02891103 0.01944167 G2M
#> AAAGTTTGATCACG-1 0.09548679 -0.07351012 S
#> AAATCATGACCACA-1 -0.06976057 -0.01901402 G1
# check PCA to see if cell cycle has strong effects
sobj_l <- FindVariableFeatures(sobj_l)
sobj_l <- RunPCA(sobj_l, verbose = FALSE, npcs = 10)
DimPlot(sobj_l, group.by = "Phase") # again, in this case, cell cycle does not have strong effects
# 3. look for cell cycle-specific genes as main drivers of PCA
topPCAgenes <- apply(
sobj_l@reductions$pca@feature.loadings, # contribution of each gene to each PC is stored here
MARGIN = 2,
FUN = function(x) names(sort(abs(x), decreasing = TRUE)[1:10]) # finds the most important genes in each PC
)
topPCAgenes
#> PC_1 PC_2 PC_3 PC_4 PC_5
#> [1,] "CST3" "NKG7" "HLA-DQA1" "PF4" "CKB"
#> [2,] "S100A9" "CST7" "CD79A" "SDPR" "FCGR3A"
#> [3,] "TYROBP" "GZMA" "CD79B" "PPBP" "MS4A7"
#> [4,] "FTL" "PRF1" "HLA-DPB1" "HIST1H2AC" "RP11-290F20.3"
#> [5,] "LST1" "GZMB" "CD74" "GNG11" "RHOC"
#> [6,] "FCN1" "FGFBP2" "HLA-DPA1" "SPARC" "SIGLEC10"
#> [7,] "AIF1" "GNLY" "MS4A1" "GP9" "LILRA3"
#> [8,] "LYZ" "CTSW" "HLA-DQB1" "NRGN" "LGALS2"
#> [9,] "FTH1" "CD79A" "HLA-DRB1" "RGS18" "HMOX1"
#> [10,] "S100A8" "CCL4" "PPBP" "PTCRA" "MS4A6A"
#> PC_6 PC_7 PC_8 PC_9 PC_10
#> [1,] "GZMK" "CCL5" "FCER1A" "IFIT1" "IFIT1"
#> [2,] "IL32" "GZMK" "CLEC10A" "ISG15" "ISG15"
#> [3,] "GZMB" "FCER1A" "ENHO" "MX1" "ANXA1"
#> [4,] "SPON2" "KLRG1" "SERPINF1" "GZMK" "PTTG1"
#> [5,] "VIM" "CLEC10A" "CD1C" "IFI6" "GBP1"
#> [6,] "S100A4" "SELL" "GZMK" "OASL" "LTB"
#> [7,] "AKR1C3" "LYAR" "RBP7" "TNFSF10" "MX1"
#> [8,] "B2M" "NKG7" "CLIC2" "ACTG1" "PCNA"
#> [9,] "S100A10" "AKR1C3" "RP6-91H8.3" "APOBEC3B" "HOPX"
#> [10,] "S100B" "IGFBP7" "S100A12" "LYAR" "CD27"
topPCAgenes %>%
as.vector() %>%
intersect(c(s.genes, g2m.genes)) # see how many S and G2M genes intersect with that list
#> [1] "PCNA"
# or just check top variable genes
# if needed, regress out cell cycle effects
sobj_l <- ScaleData(sobj_l,
vars.to.regress = c("nCount_RNA", "percent_mito", "S.Score", "G2M.Score"))
# alternatively, leave the difference of cycling vs noncycling in place, only regress out phases in actively cycling cells
sobj_l$CC.Difference <- sobj_l$S.Score - sobj_l$G2M.Score
sobj_l <- ScaleData(sobj_l,
vars.to.regress = c("nCount_RNA",
"percent_mito",
"CC.Difference"))
find all markers for each cluster
FindAllMarkers compares cells in each cluster to all other cells in the dataset. Typically, focus is given to genes upregulated in each cluster, ie markers.
For more control in the comparisons, use FindMarkers. Positive average log (natural) fold change represents higher expression of the gene in cells of ident.1. pct1 and 2 are percent detected in each ident/population.
FindMarkers defaults to the current active ident. To use other value groups, set idents to the intended column, or use the group.by argument.
# compare control vs treated in only cluster 1
markers_df3 <- FindMarkers(sobj_clusters,
assay = "RNA",
slot = "data",
subset.ident = "1", # <- if needed, subset on current ident first, then switch idents
group.by = "orig.ident", # <- grouping cells by this metadata column
ident.1 = "control",
ident.2 = "treated")
For greater control, assign new idents or new columns in metadata, and customize DE analysis pairs.
# set new idents for more customized comparisons
Idents(sobj_clusters,
WhichCells(object = sobj_clusters,
idents = "1",
expression = ZFP36 >= 1, # would usually do this for transgenes etc
slot = 'data')) <- 'ZFP36.pos'
Idents(sobj_clusters,
WhichCells(object = sobj_clusters,
idents = "1",
expression = ZFP36 < 1,
slot = 'data')) <- 'ZFP36.neg'
Idents(sobj_clusters) %>% head() # note that with this method, some cell idents might not be changed
#> AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACGCTGACCAGT-1 AAACGCTGGTTCTT-1
#> 0 2 3 3
#> AAAGTTTGATCACG-1 AAATCATGACCACA-1
#> 2 5
#> Levels: ZFP36.neg ZFP36.pos 0 2 3 4 5 6 7
markers_df4 <- FindMarkers(sobj_clusters,
ident.1 = "ZFP36.pos",
ident.2 = "ZFP36.neg")
# or something like:
# Idents(sobj_clusters,
# WhichCells(object = sobj_clusters,
# idents = "1",
# expression = sizeFactors >= 1, # would usually do this for transgenes etc
# slot = 'data')) <- 'cluster1_large'
Genes of interest can then be visualized as violin plots or feature plots.
Idents(sobj_clusters) <- sobj_clusters@meta.data$clusters # plots are grouped by active ident
VlnPlot(sobj_clusters, "LYZ")
VlnPlot(sobj_clusters, c("ZFP36", "ZFP36L2")) # can be a vector of gene names
FeaturePlot(sobj_clusters, "LYZ")
FeaturePlot(sobj_clusters, c("ZFP36", "ZFP36L2")) # can be a vector of gene names
FeaturePlot(sobj_clusters, "ZFP36", split.by = "orig.ident") # <- split into panels based on metadata column
cluster identities
Without venturing into the realm of philosphical debates on what a “cell type” constitutes, standard pratice is to use certain gene expression features to classify cells. This is often done manually, by visual inspection of key genes. Automated approaches that utilize a broader range of features are currently being developed.
using clustifyr, and transcriptome profiles (from other scRNAseq, bulk RNAseq, or microarray), using ranked correlation of highly variable genes (scmap is another similar package)
clustifyr also takes seurat objects as input directly, finds various needed data, and output another object with identities assigned.
ref <- seurat_ref(pan_celseq2,
cluster_col = "celltype")
res <- clustify(input = pan_smartseq2,
ref_mat = ref,
cluster_col = "seurat_clusters",
seurat_out = TRUE)
res
#> An object of class Seurat
#> 34363 features across 1443 samples within 1 assay
#> Active assay: RNA (34363 features)
#> 2 dimensional reductions calculated: pca, umap
# look at UMAP
DimPlot(res, label = TRUE, group.by = "type") + NoLegend() # saved in "type" metadata column
Also see more tutorials at clustifyr, and prebuilt references from large scRNAseq/bulk RNAseq/microarray datasets at package clustifyrdata.
cell type composition
Insight into different samples can be gained from the proportion of cells that fall into each cell type. Unfortunately, no dedicated tools are available for statistical testing.