This file contains the meta-analysis of the mouseAtlas.rds generated by combining reference matrices from a variety of scRNA-seq datasets from across NCBI GEO and Tabula Muris.
First, the necessary libraries will be loaded in. They are dplyr (standard dataframe library), Seurat (create Seurat object), patchwork (ggplot library), clustifyr (clustifyr functions), tidyverse (collection of R data science packages), here (reproducibility of results), and clustree (visualizing clustering in hierarchical structure)
library(dplyr)
library(Seurat)
library(patchwork)
library(clustifyr)
library(tidyverse)
library(here)
library(clustree)
The mouse atlas will now be loaded into the local R data section. This will be accomplished with the here library establishing the root of the project and then loading in the mouseAtlas.rds file into the mouseAtlas matrix object. A Seurat object will then be created in order to perform single cell analysis on the mouse atlas.
proj_dir <- here()
mouseAtlas <- readRDS(file.path(proj_dir, "atlas", "musMusculus", "mouseAtlas.rds"))
mouseMetaAnalysis <- CreateSeuratObject(counts = mouseAtlas, project = "Mouse-Meta-Analysis", min.cells = 0, min.features = 0)
gc()
## used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
## Ncells 6334225 338.3 12265532 655.1 NA 9563291 510.8
## Vcells 27125191 207.0 47007099 358.7 204800 37460299 285.8
#Normalize Data
mouseMetaAnalysis <- NormalizeData(mouseMetaAnalysis, normalization.method = "LogNormalize", scale.factor = 10000)
Following the creation of the Seurat object, the standard workflow for annotating cell types will be followed. First we will filter out cells that do not meet any of the standard QC metrics, data normalization and scaling. First we will filter out genes that contain mitochondrial contamination using a percentage feature set and scanning for the ^mt-
pattern. All of the lines following help to visualize the QC metrics we are filtering out by (nCounts, nFeatures, percent.mt)
#Preprocessing workflow
mouseMetaAnalysis[["percent.mt"]] <- PercentageFeatureSet(mouseMetaAnalysis, pattern = "^mt-")
VlnPlot(mouseMetaAnalysis, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)
plot1 <- FeatureScatter(mouseMetaAnalysis, feature1 = "nCount_RNA", feature2 = "percent.mt")
plot2 <- FeatureScatter(mouseMetaAnalysis, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
plot1 + plot2
In order to run principal component analysis on the data, cells with high gene expression variability compared to the rest of the data will need to be determined. This will help highlight biological signal of cells in further downstream analysis with PCA and uniform manifold approximation and projection (UMAP). The highly variable genes can also be plotted on a variable feature plot showing standard variance against average gene expression.
#Find Variable Features for PCA
mouseMetaAnalysis <- FindVariableFeatures(mouseMetaAnalysis, selection.method = "mean.var.plot", nfeatures = 2000)
top10 <- head(VariableFeatures(mouseMetaAnalysis), 10)
plot1 <- VariableFeaturePlot(mouseMetaAnalysis)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot1 + plot2
Once the variable features have been highlighted, the data can now be reduced into two dimensions from a large number of dimensions through principal component analysis. First, the data must be “scaled” in which a linear transformation shifts the expression of each gene so that the mean expression across cells is zero and that the variance across each cell is one. This should help highly expressed genes not to dominate the data.
Next, PCA will occur which will scale the data down into two dimensions. The results can be visualized through heat maps and scatter plots.
all.genes <- rownames(mouseMetaAnalysis)
mouseMetaAnalysis <- ScaleData(mouseMetaAnalysis, features = all.genes)
mouseMetaAnalysis <- RunPCA(mouseMetaAnalysis,
features = all.genes,
npcs = ncol(mouseMetaAnalysis) - 1)
print(mouseMetaAnalysis[["pca"]], dims = 1:5, nfeatures = 5)
## PC_ 1
## Positive: Rufy2, Tcea2, Syp, Rundc3a, Mapk10
## Negative: Tspo, Gng5, H2-K1, Vamp8, Zfp36
## PC_ 2
## Positive: F12, F3, F10, F8, Bc1
## Negative: Psma3, Hnrnpk, Pcbp1, Hnrnpa2b1, Srsf3
## PC_ 3
## Positive: Coro1a, Fam49b, Arhgap15, Arhgap30, Rassf5
## Negative: Ryk, Fermt2, Arhgef12, Pard3, Yap1
## PC_ 4
## Positive: Smim11, Szrd1, Cox20, Oser1, Gm6563
## Negative: Ahcy, Afmid, Timm23, Eif4ebp3, Ube2v1
## PC_ 5
## Positive: Ptrhd1, Akr1b3, Msrb1, Rp2, Asl
## Negative: Zbed6, B3gnt2, Cpne1, Styx, Zc3h11a
VizDimLoadings(mouseMetaAnalysis, dims = 1:2, reduction = "pca")
DimPlot(mouseMetaAnalysis, reduction = "pca")
DimHeatmap(mouseMetaAnalysis, dims = 1, cells = 500, balanced = TRUE)
DimHeatmap(mouseMetaAnalysis, dims = 1:15, cells = 500, balanced = TRUE)
Following PCA, the number of PCs or clusters that are needed to fully represent the data on a UMAP need to be determined. Many scRNA-seq datasets contain technical noise in the various clusters and the number of clusters needed to represent the dataset must be determined. In this analysis, jack straw plots and a slightly more heuristic elbow plot will be used to determine the accurate number of clusters for UMAP analysis.
#Determine dimensionality
mouseMetaAnalysis <- JackStraw(mouseMetaAnalysis, num.replicate = 100)
mouseMetaAnalysis <- ScoreJackStraw(mouseMetaAnalysis, dims = 1:20)
JackStrawPlot(mouseMetaAnalysis, dims = 1:15)
ElbowPlot(mouseMetaAnalysis)
Based on the jack straw and elbow plots, the number of clusters to accurately represent the data is around 20 to 30 clusters. We will generate clusters with a resolution of 5.0 to accomplish this. It is important to note that changing the resolution value will drastically change the number of clusters produced by the FindClusters()
method.
#Clustering
mouseMetaAnalysis <- FindNeighbors(mouseMetaAnalysis, dims = 1:10)
mouseMetaAnalysis <- FindClusters(mouseMetaAnalysis, resolution = 5.0, verbose = FALSE)
head(Idents(mouseMetaAnalysis), 5)
## Basal (GSE113049)
## 14
## Ciliated (GSE113049)
## 14
## Club (GSE113049)
## 14
## Endothelial/Fibroblast (GSE113049)
## 14
## Injured AEC2: Cell Cycle Arrest (GSE113049)
## 14
## Levels: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
#Create unannotated UMAP
mouseMetaAnalysis <- RunUMAP(mouseMetaAnalysis, dims = 1:10)
DimPlot(mouseMetaAnalysis, reduction = "umap")
In order to determine which genes are being expressed in the UMAP, biomarkers must be clustered through the use of the FindAllMarkers()
function. This function automates the process across all clusters and helps to expedite the biomarker selection process. A number of visualization steps can be taken here, but in this analysis, a heat map of the top 10 clusters with each expression of each biomarker will be shown.
#Cluster biomarkers
cluster1.markers <- FindMarkers(mouseMetaAnalysis, ident.1 = 1, min.pct = 0.25)
head(cluster1.markers, n = 5)
## p_val avg_logFC pct.1 pct.2 p_val_adj
## Arx 1.454041e-20 0.7910058 1.000 0.329 4.509999e-16
## AF529169 1.825309e-20 0.2818451 1.000 0.298 5.661560e-16
## Rxfp3 1.583074e-19 0.3851036 0.923 0.217 4.910220e-15
## Tunar 1.981040e-19 0.3247511 1.000 0.315 6.144593e-15
## Cntnap3 8.151096e-18 0.3930815 1.000 0.278 2.528225e-13
cluster5.markers <- FindMarkers(mouseMetaAnalysis, ident.1 = 5, ident.2 = c(0,3), min.pct = 0.25)
head(cluster5.markers, n = 5)
## p_val avg_logFC pct.1 pct.2 p_val_adj
## Scoc 2.848419e-11 0.4154008 1 1.000 8.834940e-07
## Scg3 3.035629e-11 0.6699949 1 0.765 9.415610e-07
## Fam45a 3.089705e-11 0.3495785 1 0.961 9.583337e-07
## Ngrn 3.090779e-11 0.2689584 1 0.980 9.586670e-07
## Nrsn2 3.443805e-11 0.8891658 1 0.549 1.068165e-06
mouseMetaAnalysis.markers <- FindAllMarkers(mouseMetaAnalysis, only.pos = TRUE, min.pct = 0.1, logfc.threshold = 0.25)
mouseMetaAnalysis.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
## # A tibble: 32 x 7
## # Groups: cluster [16]
## p_val avg_logFC pct.1 pct.2 p_val_adj cluster gene
## <dbl> <dbl> <dbl> <dbl> <dbl> <fct> <chr>
## 1 1.53e-17 1.05 1 0.44 4.75e-13 0 Neurod6
## 2 2.11e-15 1.06 1 0.536 6.55e-11 0 Slc17a7
## 3 1.85e-15 1.45 1 0.468 5.74e-11 1 Sst
## 4 2.72e-15 1.19 1 0.603 8.44e-11 1 Npy
## 5 2.04e-30 1.62 1 0.112 6.33e-26 2 mt-Atp6
## 6 5.48e-24 1.61 1 0.258 1.70e-19 2 Spi1
## 7 2.21e-34 1.26 1 0.101 6.87e-30 3 Plpp3
## 8 3.21e-25 1.16 1 0.121 9.97e-21 3 Gm42418
## 9 3.35e-15 1.72 1 0.695 1.04e-10 4 Ptprcap
## 10 1.14e-14 1.59 1 0.591 3.55e-10 4 Cd79a
## # … with 22 more rows
cluster1.markers <- FindMarkers(mouseMetaAnalysis, ident.1 = 0, logfc.threshold = 0.25, test.use = "roc", only.pos = TRUE)
top10 <- mouseMetaAnalysis.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC) #Create top 10 markers for each cluster
DoHeatmap(mouseMetaAnalysis, features = top10$gene) + NoLegend() #Create heat map of biomarkers for all top 17 clusters
Annotating the previously generated UMAP can clarify cell types present in the UMAP and allows for a better analysis of which cells are located where on the UMAP. Here, the column names in the mouseAtlas.rds
file are the cell types so we can rename the identities of the cells on the UMAP with the correct cell names. HoverLocator
and a UMAP grouped by study can also be helpful to highlight where each cell in the atlas originates.
#Assign cell types/Annotate UMAP
new.cluster.ids <- colnames(mouseAtlas)
names(new.cluster.ids) <- levels(mouseMetaAnalysis)
mouseMetaAnalysis <- RenameIdents(mouseMetaAnalysis, new.cluster.ids)
AnnotatedUMAP <- DimPlot(mouseMetaAnalysis, reduction = "umap", label = FALSE, pt.size = 0.5)
HoverLocator(plot = AnnotatedUMAP, information = FetchData(mouseMetaAnalysis, vars = c("seurat_clusters")))
mouseMetaAnalysis@meta.data$study <- str_remove(rownames(mouseMetaAnalysis@meta.data), ".+\\(") %>% str_remove("\\)")
DimPlot(mouseMetaAnalysis, reduction = "umap", group.by = "study")
Another useful analysis to determine the validity of the cell clusters is to visualize the hierarchy of the generated clusters. This could be helpful in further downstream analysis when similar cell types are added to the atlas. Here, the clustree library is used to help visualize the hierarchy of clusters.
mouseMetaAnalysis <- FindClusters(mouseMetaAnalysis, resolution = 1.0, verbose = FALSE)
mouseMetaAnalysis <- FindClusters(mouseMetaAnalysis, resolution = 3.0, verbose = FALSE)
mouseMetaAnalysis <- FindClusters(mouseMetaAnalysis, resolution = 5.0, verbose = FALSE)
mouseMetaAnalysis@meta.data$RNA_snn_res.1000 <- rownames(mouseMetaAnalysis@meta.data)
g <- clustree(mouseMetaAnalysis,
layout = "sugiyama",
use_core_edges = FALSE,
node_text_size = 2,
node_alpha = 0,
edge_width = 1) +
scale_edge_alpha(range = c(0.05,0.05)) + # otherwise edges cover everything up
geom_text(aes(x = 0, y = -10, label = "mouse", size = 2)) # just to make some room so labels aren't cut off
# move the single cell layer of nodes down for more space
gedit <- g$data[, "RNA_snn_res."] == 1000
g$data[gedit, "y"] <- -5
# rotate single cell layer texts
g$layers[[3]]$aes_params$angle <- 90
g