The goal of this lecture is to give an overview of scRNA seq, quality control, filtering, and briefly touch on PCA
RC1 South room9101
office hours on Thursday afternoons
For some, or many of you, this might be your first time programming. Since you are taking this course, I probably don’t have to spend a ton of time convincing you of the benefits of learning how to code. But I want to briefly touch on just a few of the benefits before proceeding with the lecture material.
When it comes to bioinformatics, there are really 3 main languages/frameworks you will consistantly work with. These are: bash, R, and python.
In this course, we will almost exclusively be working with R. R is a great first language to learn, because it is primarily used for data cleaning, analysis, visualization, and statistics. It is also a great place to start, because many bioinformatics packages are written in R. Finally, if you have done a lot of work in Excel (as most scientists have), the fundamental concepts and functions in R, should be fairly*** straightforward to pick up.
***with some practice
One misconception I had when I started programming was that programmers have the knowledge to carry out most analyses prior to writing code. Hollywood has created this archetype of the ‘whiz-kid’ or ‘misunderstood genius hacker’ who write code at lightning speed and who just know what code needs to be written and exactly how to write it. This is not how it works in the real world. Most programmers I know spend far more time googling and thinking than writing code.
You are going to get stuck a lot.
You are going to get a lot of error messages.
You are going to make mistakes.
Stack overflow
RBI
Online tutorials
Biostars and SeqAnswers
Papers
… Now that we’ve covered an overview of programming tips, let’s switch gears and switch gears to the reason you are all here
RNA seq is similar to a fruit smoothie. The individual fruits (strawberries, bananas, apples, oranges) that would make up the smoothie would be analogous to the different cell types in a population. You blend up the fruits into the smoothie (lyse cells, extract RNA, and make libraries), and then you drink it (run it on a sequencer).
But when you drink a smoothie, its harder to tell which flavors came from which fruits than if you were to eat each fruit individually. Especially if you had something unexpected, like kale***, in the smoothie. This is analogous to bulk RNA-sequencing. In bulk RNA seq, you lose all information about the cell of origin for each read. If you wanted to know the RNA expression in a certain cell type, you had to try to purify that cell type from the smoothie and sequence it. Thereby losing all information about other cell types in your samples.
***yuck
Enter single cell seq. Single cell sequencing technologies allow a researcher to tag individual cells with a unique sequence, called a barcode, that allows you to know which cell a read originates from. Therefore, you can computationally parse which reads came.
Through microfluics wizardy, cell gets loaded into a droplet with a bead with all the DNA bits needed to do the transcript barcoding
In theory, each drop will have (1) a cell and (2) a DNA bit that will do the barcoding. This DNA bit contains a unique barcode, a UMI (for correcting PCR amplification), and an oligodT stretch (because mRNA).
Through some clever (simple) math: (1) a small number of cells are loaded into each well a multi-well plate and barcoded. Each well has a different barcode (2) Then they are pooled back together (3) The process is repeated with a plate containing another set of barcodes. The probability that 2 cells pick up the exact same combintation of barcodes becomes vanishingly small.
Combines the two methods and overloads the droplets
“There are no solutions, only trade-offs”. And in the case of bulk RNA seq vs single-cell RNA seq, this is certainly true… for now. The information you gain about the cellular origin read from comes at a price:
only a small portion of the mRNAs from each cell are captured
low number of detectable genes, might not detect low expressing genes at all
little info outside of gene counts
Almost all sequencing experiments fail or succeed at the design step.
Programming cannot fix bad data. Crap in crap out.
cellranger performs alignment, filtering, UMI counting, clustering, and gene expression analysis.
Alternatives such as Alevin, STARsolo, Kallisto
# in terminal
cellranger count --id=123 \
--transcriptome=/refdata-cellranger-GRCh38-3.0.0 \
# build new transcriptome if you have GFP/RFP/transgene
--fastqs=/home/runs/HAT7ADXX/outs/fastq_path \
# a list of fastqs, a folder, or pass table via csv file
--sample=mysample
# will need additional arguments for feature barcoding
output folder structure
KO_1_CDNA|-- KO_1_CDNA.mri.tgz
|-- SC_RNA_COUNTER_CS
| |-- CLOUPE_PREPROCESS
| |-- EXPAND_SAMPLE_DEF
| |-- SC_RNA_COUNTER
| `-- fork0
|-- _cmdline
|-- _filelist
|-- _finalstate
|-- _invocation
|-- _jobmode
|-- _log
|-- _mrosource
|-- _perf
|-- _sitecheck
|-- _tags
|-- _timestamp
|-- _uuid
|-- _vdrkill
|-- _versions
`-- outs
|-- analysis
|-- cloupe.cloupe # cloupe file for browser view
|-- filtered_feature_bc_matrix # use this fold for seurat
|-- filtered_feature_bc_matrix.h5
|-- metrics_summary.csv
|-- molecule_info.h5
|-- possorted_genome_bam.bam
|-- possorted_genome_bam.bam.bai
|-- raw_feature_bc_matrix # or this for seurat
|-- raw_feature_bc_matrix.h5
`-- web_summary.html # qc summary
~ 5-8 hours per sample on biochem department cluster “Bodhi”
theoretically can be ran locally on linux, but will require at least 32GB of RAM
other campus options include Rosalind, AWS https://github.com/rnabioco/cellrangerAWS
cellranger output html and loupe files
The total UMI (unique molecular identifier - represent each transcript) of a cell barcode is used to rank the barcodes determine the UMI threshold for signal vs noise. A plot is generated for cellranger html output, but will also be useful in other situations like hashing and CITE-seq.
for a standard 10x scRNAseq run:
expected range of x axis (barcode) : ~ 10^5 (if using ggplot to visualize, filter)
expected inflection point of x axis (cell number): ~ # of loaded cells / 2
expected range of y axis (UMI_counts) : ~ 10^4
expected inflection point of y axis (cutoff UMI count): ~ 1000
library(tidyverse)
# use "raw" instead of "filtered" cellranger output folder
data_url = "https://scrnaseq-workshop.s3-us-west-2.amazonaws.com"
m1 <- readRDS(url(file.path(data_url, "raw_matrix.rds")))
# all genes x all barcodes
counts <- Matrix::colSums(m1) # calculate total UMI read number for each cell barcode
countdf <- as.data.frame(counts) %>%
as_tibble(rownames = "barcode") %>%
filter(counts >= 2) %>% # throw out cell barcodes with 1 or less UMI, this is mainly for time purposes
arrange(desc(counts)) %>% # arrange by descending order
mutate(rank = 1:n()) # rank
head(countdf) # barcodes now ranked by UMI counts
#> # A tibble: 6 × 3
#> barcode counts rank
#> <chr> <dbl> <int>
#> 1 GCAGCCACATACCGTA 40745 1
#> 2 ATGGAGGGTGGTAACG 20916 2
#> 3 GTAACCATCGCTTGAA 19958 3
#> 4 TTCATGTGTCGTGTTA 19765 4
#> 5 GAATAGACATCCTGTC 18494 5
#> 6 GTCTCACGTTGGCCGT 17828 6
ggplot(countdf, aes(x = rank, y = counts)) +
geom_point() +
labs(x = "barcodes", y = "UMI_counts") +
theme_classic() +
scale_x_log10() +
scale_y_log10()
Single-cell RNA-seq counts are usually stored as a sparse matrix due to the high percentage of zeros. In a sparse matrix zeros are removed and only non-zero values are stored, which saves memory and speeds up operations.
The Read10X
function can be used with the output directory generated by 10X Cell Ranger to load the counts data as a sparse matrix.
# Import matrix of counts
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")
pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
#min.cells- Include features detected in at least this many cells. Will subset the counts matrix as well. To reintroduce excluded features, create a new object with a lower cutoff.
#min.features- Include cells where at least this many features are detected.
head(pbmc@meta.data, 5)
#> orig.ident nCount_RNA nFeature_RNA
#> AAACATACAACCAC-1 pbmc3k 2419 779
#> AAACATTGAGCTAC-1 pbmc3k 4903 1352
#> AAACATTGATCAGC-1 pbmc3k 3147 1129
#> AAACCGTGCTTCCG-1 pbmc3k 2639 960
#> AAACCGTGTATGCG-1 pbmc3k 980 521
# lets look at come of the count data. It is called a sparse matrix because zeros are represented as '.' saves memory
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
#> 3 x 30 sparse Matrix of class "dgCMatrix"
#>
#> CD3D 4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2 3 . . . . . 3 4 1 5
#> TCL1A . . . . . . . . 1 . . . . . . . . . . . . 1 . . . . . . . .
#> MS4A1 . 6 . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .
At the top level, the Seurat object serves as a collection of Assay and DimReduc objects, representing expression data and dimensionality reductions of the expression data, respectively. The Assay objects are designed to hold expression data of a single type, such as RNA-seq gene expression, CITE-seq ADTs, cell hashtags, or imputed gene values. DimReduc objects represent transformations of the data contained within the Assay object(s) via various dimensional reduction techniques such as PCA.
See : https://github.com/satijalab/seurat/wiki/Seurat#slots for a more detailed explanation of the Seurat object
pbmc
#> An object of class Seurat
#> 13714 features across 2700 samples within 1 assay
#> Active assay: RNA (13714 features, 0 variable features)
# dim provides both nrow and ncol at the same time
dim(x = pbmc)
#> [1] 13714 2700
#> [1] "AL627309.1" "AP006222.2" "RP11-206L10.2" "RP11-206L10.9"
#> [5] "LINC00115" "NOC2L"
#> [1] "AAACATACAACCAC-1" "AAACATTGAGCTAC-1" "AAACATTGATCAGC-1"
#> [4] "AAACCGTGCTTCCG-1" "AAACCGTGTATGCG-1" "AAACGCACTGGTAC-1"
# A vector of names of associated objects can be had with the names function.
names(x = pbmc)
#> [1] "RNA"
# IThese can be passed to the double [[ extract operator to pull them from the Seurat object
pbmc[['RNA']]
#> Assay data with 13714 features for 2700 cells
#> First 10 features:
#> AL627309.1, AP006222.2, RP11-206L10.2, RP11-206L10.9,
#> LINC00115, NOC2L, KLHL17, PLEKHN1, RP11-54O7.17, HES4
# Looking at the metadata. We will add another column to this later when we look at the percentage of reads per cell mapping to mitochondrial transcripts
head(pbmc@meta.data)
#> orig.ident nCount_RNA nFeature_RNA
#> AAACATACAACCAC-1 pbmc3k 2419 779
#> AAACATTGAGCTAC-1 pbmc3k 4903 1352
#> AAACATTGATCAGC-1 pbmc3k 3147 1129
#> AAACCGTGCTTCCG-1 pbmc3k 2639 960
#> AAACCGTGTATGCG-1 pbmc3k 980 521
#> AAACGCACTGGTAC-1 pbmc3k 2163 781
Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics commonly used by the community include
The number of unique genes detected in each cell.
Low-quality cells or empty droplets will often have very few genes.
Cell doublets or multiplets may exhibit an aberrantly high gene count.
Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes)
The percentage of reads that map to the mitochondrial genome is indicative of dying cells.
Low-quality / dying cells often exhibit extensive mitochondrial contamination.
We calculate mitochondrial QC metrics with the PercentageFeatureSet() function, which calculates the percentage of counts originating from a set of features We use the set of all genes starting with MT- as a set of mitochondrial genes
#We will add a column to the metadata calculating the percentage of genes mapping to mitochondrial transcripts
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
#We can now see that the metadata now includes the percentage of mitochondrial genes
head(pbmc@meta.data, 5)
#> orig.ident nCount_RNA nFeature_RNA percent.mt
#> AAACATACAACCAC-1 pbmc3k 2419 779 3.0177759
#> AAACATTGAGCTAC-1 pbmc3k 4903 1352 3.7935958
#> AAACATTGATCAGC-1 pbmc3k 3147 1129 0.8897363
#> AAACCGTGCTTCCG-1 pbmc3k 2639 960 1.7430845
#> AAACCGTGTATGCG-1 pbmc3k 980 521 1.2244898
#We can also peak into the metadata by examining just a particular column. Similar to dplyr functions. Try erasing 'percent mito' and tab completing
head(x = pbmc$percent.mt)
#> AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1
#> 3.0177759 3.7935958 0.8897363 1.7430845
#> AAACCGTGTATGCG-1 AAACGCACTGGTAC-1
#> 1.2244898 1.6643551
We want to remove low quality cells for our downstream analyses. Using the plot above, we can draw some cutoffs by eye. Here we will filter out cells by getting rid of cells that have unique feature counts over 2,500 or less than 200. We also filter cells that have >5% mitochondrial counts
pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
#Note that some cells were filtered out- We started with 2700
dim(x = pbmc)
#> [1] 13714 2638
After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored in pbmc[[“RNA”]]@data.
Think of this step as normalizing expression of each transcripts to the read depth for all genes in that cell. The log transofrmation will compress lowly and highly expressed genes to a more similar scale
pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
head(pbmc[["RNA"]]@data,100)
#> 100 x 2638 sparse Matrix of class "dgCMatrix"
#>
#> AL627309.1 . . . . . . .
#> AP006222.2 . . . . . . .
#> RP11-206L10.2 . . . . . . .
#> RP11-206L10.9 . . . . . . .
#> LINC00115 . . . . . . .
#> NOC2L . . . . . . .
#> KLHL17 . . . . . . .
#> PLEKHN1 . . . . . . .
#> RP11-54O7.17 . . . . . . .
#> HES4 . . . . . . .
#> RP11-54O7.11 . . . . . . .
#> ISG15 . . 1.429744 3.558310 . 1.726902 .
#> AGRN . . . . . . .
#> C1orf159 . . . . . . .
#> TNFRSF18 . 1.625141 . . . . .
#> TNFRSF4 . . . . . . .
#> SDF4 . . 1.429744 . . . .
#> B3GALT6 . . . . . . 1.722356
#> FAM132A . . . . . . .
#> UBE2J2 . . . . . . .
#> ACAP3 . . . . . . .
#> PUSL1 . . . . . . 1.722356
#> CPSF3L . . . . . . .
#> GLTPD1 . . . . . . .
#> DVL1 . . . . . . .
#> MXRA8 . . . . . . .
#> AURKAIP1 . 1.111715 . 1.566387 . . .
#> CCNL2 . . . . . . .
#> RP4-758J18.2 . . . . . . .
#> MRPL20 1.635873 . 1.429744 . . . .
#> ATAD3C . . . . . . .
#> ATAD3B . . . . . . .
#> ATAD3A . . . . . . .
#> SSU72 . 1.111715 . 2.515108 . . .
#> AL645728.1 . . . . . . .
#> C1orf233 . . . . . . .
#> RP11-345P4.9 . . . . . . .
#> MIB2 . 1.625141 . . . . 1.722356
#> MMP23B . . . . . . .
#> CDK11B . . . . . . .
#> SLC35E2B . . . . . . .
#> CDK11A . . . . . . .
#> SLC35E2 . . . . . . .
#> NADK . . . 1.566387 . . .
#> GNB1 . . . . . . .
#> RP1-140A9.1 . . . . . . .
#> TMEM52 . . . . . . .
#> PRKCZ . . . . . . .
#> RP5-892K4.1 . . . . . . .
#> C1orf86 . . . . . . .
#> AL590822.2 . . . . . . .
#> SKI . . . . . . .
#> RER1 . 1.111715 1.429744 1.566387 . . .
#> PEX10 . . . . . . .
#> PLCH2 . . . . . . .
#> PANK4 . . 1.429744 . . 1.726902 .
#> RP3-395M20.12 . 1.111715 . . . . .
#> TNFRSF14 . . . 1.566387 . 1.726902 .
#> RP3-395M20.9 . . . . . . .
#> FAM213B . . . . . . .
#> MMEL1 . . . . . . .
#> TTC34 . . . . . . .
#> MEGF6 . . . . . . .
#> TPRG1L . . . . . . .
#> WRAP73 . . . . . . .
#> TP73-AS1 . . . . . . .
#> SMIM1 . . . . . . .
#> LRRC47 . . 1.429744 . . . .
#> CEP104 . . . . . . .
#> DFFB . . . . . . .
#> C1orf174 . . . . . . .
#> NPHP4 . . . . . . .
#> KCNAB2 . . . . . . .
#> RPL22 1.635873 1.962726 1.429744 2.149274 . . .
#> RNF207 . . . . . . .
#> ICMT . . . . . . .
#> GPR153 . . . . . . .
#> ACOT7 . . . . . . 1.722356
#> RP1-202O8.3 . . . . . . .
#> ESPN . . . . . . .
#> TNFRSF25 2.226555 . . . . . .
#> PLEKHG5 . . . . . . .
#> NOL9 . . . . . . .
#> ZBTB48 . . . . . . .
#> KLHL21 . . . . . . .
#> PHF13 . . . . . . .
#> THAP3 . . . . . . .
#> DNAJC11 . . . . . . .
#> RP11-312B8.1 . . . . . . .
#> CAMTA1 . 1.111715 . . . . .
#> VAMP3 . . . . . . .
#> PER3 . . . . . . .
#> UTS2 . . . . . . .
#> TNFRSF9 . . . . . . .
#> PARK7 . . 1.995416 . . . .
#> SLC45A1 . . . . . . .
#> RERE . . . . . . 1.722356
#> RP5-1115A15.1 . . . . . . .
#> ENO1 . 1.625141 1.995416 2.149274 . . 1.722356
#> ENO1-AS1 . . . . . . .
#>
#> AL627309.1 . . . . . .
#> AP006222.2 . . . . . .
#> RP11-206L10.2 . . . . . .
#> RP11-206L10.9 . . . . . .
#> LINC00115 . . . . . .
#> NOC2L . . . . 1.646272 .
#> KLHL17 . . . . . .
#> PLEKHN1 . . . . . .
#> RP11-54O7.17 . . . . . .
#> HES4 . . . . . .
#> RP11-54O7.11 . . . . . .
#> ISG15 . . 3.339271 . . 1.638876
#> AGRN . . . . . .
#> C1orf159 . . . . . .
#> TNFRSF18 . . . . . .
#> TNFRSF4 . 2.179642 . . . .
#> SDF4 . . . . . .
#> B3GALT6 . . . . . .
#> FAM132A . . . . . .
#> UBE2J2 . 2.179642 . 1.268336 . .
#> ACAP3 . . . . . .
#> PUSL1 . . . . . .
#> CPSF3L . . . 1.268336 . .
#> GLTPD1 . . . . . .
#> DVL1 . . . . . .
#> MXRA8 . . . . . .
#> AURKAIP1 1.690977 . . 1.809904 . 2.229881
#> CCNL2 . . . . . .
#> RP4-758J18.2 . . . 1.268336 . .
#> MRPL20 1.690977 . . 1.809904 . .
#> ATAD3C . . . . . .
#> ATAD3B . . . . 1.646272 .
#> ATAD3A . . . . . .
#> SSU72 . . . . . .
#> AL645728.1 . . . . . .
#> C1orf233 . . . . . .
#> RP11-345P4.9 . . . . . .
#> MIB2 . . . . . .
#> MMP23B . . . . . .
#> CDK11B . . . . . .
#> SLC35E2B . . . . . .
#> CDK11A . . . . . .
#> SLC35E2 . . . . . .
#> NADK . . . . . .
#> GNB1 . 2.179642 . . . .
#> RP1-140A9.1 . . . . . .
#> TMEM52 . . . . . .
#> PRKCZ . . . . . .
#> RP5-892K4.1 . . . . . .
#> C1orf86 . 2.179642 . . . .
#> AL590822.2 . . . . . .
#> SKI . . . . . .
#> RER1 . . . . 1.646272 .
#> PEX10 . . . . . .
#> PLCH2 . . . . . .
#> PANK4 . . . . . .
#> RP3-395M20.12 . . . . . .
#> TNFRSF14 . . . 1.268336 . .
#> RP3-395M20.9 . . . . . .
#> FAM213B . . . . . .
#> MMEL1 . . . . . .
#> TTC34 . . . . . .
#> MEGF6 . . . . . .
#> TPRG1L . . . . . .
#> WRAP73 . . . . . .
#> TP73-AS1 . . . . 1.646272 .
#> SMIM1 . . . . . .
#> LRRC47 . . . . . .
#> CEP104 . . . . . .
#> DFFB . . . . . .
#> C1orf174 . 2.179642 . . . .
#> NPHP4 . . . . . .
#> KCNAB2 . . . . . .
#> RPL22 1.690977 . . 2.159268 2.238069 2.229881
#> RNF207 . . . . . .
#> ICMT . . . . . .
#> GPR153 . . . . . .
#> ACOT7 . . . . . .
#> RP1-202O8.3 . . . . . .
#> ESPN . . . . . .
#> TNFRSF25 1.690977 . . . . .
#> PLEKHG5 . . . . . .
#> NOL9 . . . . . .
#> ZBTB48 . . . . . .
#> KLHL21 . . . . . .
#> PHF13 . . . . . .
#> THAP3 1.690977 . . . . .
#> DNAJC11 . . . . . .
#> RP11-312B8.1 . . . . . .
#> CAMTA1 . . . . . .
#> VAMP3 . . . . . .
#> PER3 . . . . . .
#> UTS2 . . . . . .
#> TNFRSF9 . . . . . .
#> PARK7 . . 2.309182 1.268336 . 2.229881
#> SLC45A1 . . . . . .
#> RERE . . . . . .
#> RP5-1115A15.1 . . . . . .
#> ENO1 . . . 1.268336 . .
#> ENO1-AS1 . . . . . .
#>
#> AL627309.1 . . . . . .
#> AP006222.2 . . . . . .
#> RP11-206L10.2 . . . . . .
#> RP11-206L10.9 . . . . . .
#> LINC00115 . . . . . .
#> NOC2L . . . . . .
#> KLHL17 . . . . . .
#> PLEKHN1 . . . . . .
#> RP11-54O7.17 . . . . . .
#> HES4 . . . 1.157353 . .
#> RP11-54O7.11 . . . . . .
#> ISG15 2.861362 . 3.267762 1.157353 . .
#> AGRN . . . . . .
#> C1orf159 . . . . . .
#> TNFRSF18 . . . . . .
#> TNFRSF4 . . . . 3.737279 .
#> SDF4 . . . . 1.485076 1.102225
#> B3GALT6 . . . . . .
#> FAM132A . . . . . .
#> UBE2J2 1.457932 . . . 1.485076 .
#> ACAP3 . 2.270898 . . . .
#> PUSL1 . . . . . .
#> CPSF3L . . . . . .
#> GLTPD1 . . . . . .
#> DVL1 . . . . . .
#> MXRA8 . . . . . .
#> AURKAIP1 1.457932 . . . . .
#> CCNL2 1.457932 . . 1.679524 1.485076 .
#> RP4-758J18.2 . . . . . .
#> MRPL20 1.457932 . . . 1.485076 .
#> ATAD3C . . . . . .
#> ATAD3B . . . . . 1.102225
#> ATAD3A . . . . . .
#> SSU72 1.457932 . . 2.020819 . 1.613772
#> AL645728.1 . . . . . .
#> C1orf233 . . . . . .
#> RP11-345P4.9 . . . . . .
#> MIB2 1.457932 . . . . .
#> MMP23B . . . . . .
#> CDK11B . . . . . .
#> SLC35E2B . . . . . .
#> CDK11A 1.457932 . . . . .
#> SLC35E2 . . . . . .
#> NADK 2.027376 . . . . .
#> GNB1 . . . 1.157353 . 1.102225
#> RP1-140A9.1 . . . . . .
#> TMEM52 . . . . . .
#> PRKCZ . . . . . .
#> RP5-892K4.1 . . . . . .
#> C1orf86 . . . . . .
#> AL590822.2 . . . . . .
#> SKI . . . . . .
#> RER1 1.457932 . . 1.679524 . .
#> PEX10 . . . . . .
#> PLCH2 . . . . . .
#> PANK4 . . . . . .
#> RP3-395M20.12 . . . . . .
#> TNFRSF14 . . . 1.157353 . .
#> RP3-395M20.9 . . . . . .
#> FAM213B . . . . . .
#> MMEL1 . . . . . .
#> TTC34 . . . . . .
#> MEGF6 . . . . . .
#> TPRG1L . . . . . .
#> WRAP73 . . . . . .
#> TP73-AS1 . . . . . .
#> SMIM1 . . . . . .
#> LRRC47 . . . . . .
#> CEP104 . . . . . .
#> DFFB . . . . . .
#> C1orf174 . . . . . 1.102225
#> NPHP4 . . . . . .
#> KCNAB2 . . . 1.157353 . 1.102225
#> RPL22 1.457932 2.270898 3.267762 2.274803 1.485076 2.569949
#> RNF207 . . . . . .
#> ICMT . . . . . .
#> GPR153 . . . . . .
#> ACOT7 . . . . . .
#> RP1-202O8.3 . . . . . .
#> ESPN . . . . . .
#> TNFRSF25 . . . . . .
#> PLEKHG5 . . . . . .
#> NOL9 . . . . . .
#> ZBTB48 . . . . . .
#> KLHL21 . . . . . .
#> PHF13 . . . . . .
#> THAP3 1.457932 . . . . .
#> DNAJC11 . . . . . .
#> RP11-312B8.1 . . . . . .
#> CAMTA1 . . . . . .
#> VAMP3 1.457932 . . . . .
#> PER3 . . . . . .
#> UTS2 . . . . . .
#> TNFRSF9 . . . . . .
#> PARK7 . . . 2.020819 1.485076 1.950553
#> SLC45A1 . . . . . .
#> RERE . . . . . .
#> RP5-1115A15.1 . . . . . .
#> ENO1 1.457932 . . 2.645395 1.485076 2.202039
#> ENO1-AS1 . . . . . .
#>
#> AL627309.1 . . . . . .
#> AP006222.2 . . . . . .
#> RP11-206L10.2 . . . . . .
#> RP11-206L10.9 . . . . . .
#> LINC00115 . . . . . .
#> NOC2L . 1.398186 . . . .
#> KLHL17 . . . . . .
#> PLEKHN1 . . . . . .
#> RP11-54O7.17 . . . . . .
#> HES4 . . . . . .
#> RP11-54O7.11 . . . . . .
#> ISG15 . 1.398186 . 1.553327 1.670007 .
#> AGRN . . . . . .
#> C1orf159 . . . . . .
#> TNFRSF18 . . . . . .
#> TNFRSF4 . . . . . .
#> SDF4 . . . . 1.670007 .
#> B3GALT6 . . . . . .
#> FAM132A . . . . . .
#> UBE2J2 . . . . . .
#> ACAP3 . . . . . .
#> PUSL1 . . . . . .
#> CPSF3L . . . . . .
#> GLTPD1 . . . . . .
#> DVL1 . . . . . .
#> MXRA8 . . . . . .
#> AURKAIP1 . 1.398186 . 1.553327 . .
#> CCNL2 . . . . . .
#> RP4-758J18.2 . . . . . .
#> MRPL20 . . . 1.553327 . .
#> ATAD3C . . . . . .
#> ATAD3B . . . . . .
#> ATAD3A . . . . . .
#> SSU72 . . 2.309999 . . 2.089658
#> AL645728.1 . . . . . .
#> C1orf233 . . . . . .
#> RP11-345P4.9 . . . . . .
#> MIB2 . . . . . .
#> MMP23B . . . . . .
#> CDK11B . . . 1.553327 . .
#> SLC35E2B . . . . . .
#> CDK11A . . . 1.553327 . .
#> SLC35E2 . . . . . .
#> NADK . . . . . .
#> GNB1 . . . . 1.670007 .
#> RP1-140A9.1 . . . . . .
#> TMEM52 . . . . . .
#> PRKCZ . . . . . .
#> RP5-892K4.1 . . . . . .
#> C1orf86 . . . . . .
#> AL590822.2 . . . . . .
#> SKI . . . . . .
#> RER1 . . . . . .
#> PEX10 . . . . . .
#> PLCH2 . . . . . .
#> PANK4 . . . . . .
#> RP3-395M20.12 . . . . . .
#> TNFRSF14 2.184526 . . . 1.670007 .
#> RP3-395M20.9 . . . . . .
#> FAM213B . . . . . .
#> MMEL1 . . . . . .
#> TTC34 . . . . . .
#> MEGF6 . . . . . .
#> TPRG1L . . . . . .
#> WRAP73 . . . . . .
#> TP73-AS1 . . . . . .
#> SMIM1 . . . . . .
#> LRRC47 . . . . . .
#> CEP104 . . . . . .
#> DFFB . . . . . .
#> C1orf174 . . . . . 2.089658
#> NPHP4 . . . . . .
#> KCNAB2 . . . . . 2.718944
#> RPL22 3.482647 2.579565 . 1.553327 2.264302 2.089658
#> RNF207 . . . . . .
#> ICMT . . . . . .
#> GPR153 . . . . . .
#> ACOT7 . . . . . .
#> RP1-202O8.3 . . . . . .
#> ESPN . . . . . .
#> TNFRSF25 . . . . . .
#> PLEKHG5 . . . . . .
#> NOL9 . . . . . .
#> ZBTB48 . . . . . .
#> KLHL21 . . . . . .
#> PHF13 . . . . . .
#> THAP3 . . . . . .
#> DNAJC11 . . . . . .
#> RP11-312B8.1 . . . . . .
#> CAMTA1 . . . . . 4.533330
#> VAMP3 . . . . . .
#> PER3 . . . . . .
#> UTS2 . . . . . .
#> TNFRSF9 . . . . . .
#> PARK7 . 1.398186 . . . .
#> SLC45A1 . . . . . .
#> RERE . . . . . .
#> RP5-1115A15.1 . . . . . .
#> ENO1 . 1.959489 . . 1.670007 .
#> ENO1-AS1 . . . . . .
#>
#> AL627309.1 . . . ......
#> AP006222.2 . . . ......
#> RP11-206L10.2 . . . ......
#> RP11-206L10.9 . . . ......
#> LINC00115 . . . ......
#> NOC2L . . . ......
#> KLHL17 . . . ......
#> PLEKHN1 . . . ......
#> RP11-54O7.17 . 1.015884 . ......
#> HES4 . . . ......
#> RP11-54O7.11 . . . ......
#> ISG15 . 1.509310 . ......
#> AGRN . . . ......
#> C1orf159 . . . ......
#> TNFRSF18 . . . ......
#> TNFRSF4 . . . ......
#> SDF4 . . . ......
#> B3GALT6 . . . ......
#> FAM132A . . . ......
#> UBE2J2 . 1.015884 . ......
#> ACAP3 . . . ......
#> PUSL1 . . . ......
#> CPSF3L . . . ......
#> GLTPD1 . . . ......
#> DVL1 . . . ......
#> MXRA8 . . . ......
#> AURKAIP1 . 1.015884 . ......
#> CCNL2 . . . ......
#> RP4-758J18.2 . . . ......
#> MRPL20 . . . ......
#> ATAD3C . . . ......
#> ATAD3B . . . ......
#> ATAD3A . . 1.355669 ......
#> SSU72 . . . ......
#> AL645728.1 . . . ......
#> C1orf233 . 1.015884 . ......
#> RP11-345P4.9 . . . ......
#> MIB2 . . . ......
#> MMP23B . . . ......
#> CDK11B . . . ......
#> SLC35E2B . . . ......
#> CDK11A . 1.015884 . ......
#> SLC35E2 . . . ......
#> NADK . . . ......
#> GNB1 . . . ......
#> RP1-140A9.1 . . . ......
#> TMEM52 . . . ......
#> PRKCZ . . . ......
#> RP5-892K4.1 . . . ......
#> C1orf86 . 1.838231 . ......
#> AL590822.2 . . . ......
#> SKI . . . ......
#> RER1 . . 1.355669 ......
#> PEX10 . . . ......
#> PLCH2 . . . ......
#> PANK4 . . . ......
#> RP3-395M20.12 . . . ......
#> TNFRSF14 . 1.015884 . ......
#> RP3-395M20.9 . . . ......
#> FAM213B . . . ......
#> MMEL1 . . . ......
#> TTC34 . . . ......
#> MEGF6 . . . ......
#> TPRG1L . . . ......
#> WRAP73 . . . ......
#> TP73-AS1 . . . ......
#> SMIM1 . . . ......
#> LRRC47 . . . ......
#> CEP104 . . . ......
#> DFFB . . . ......
#> C1orf174 . . . ......
#> NPHP4 . . . ......
#> KCNAB2 . . . ......
#> RPL22 2.726919 1.509310 3.051899 ......
#> RNF207 . . . ......
#> ICMT . . . ......
#> GPR153 . . . ......
#> ACOT7 . . . ......
#> RP1-202O8.3 . . . ......
#> ESPN . . . ......
#> TNFRSF25 . . . ......
#> PLEKHG5 . . . ......
#> NOL9 . . . ......
#> ZBTB48 . . . ......
#> KLHL21 . . . ......
#> PHF13 . . . ......
#> THAP3 . . . ......
#> DNAJC11 . . . ......
#> RP11-312B8.1 . . . ......
#> CAMTA1 . . 1.910832 ......
#> VAMP3 . . . ......
#> PER3 . . . ......
#> UTS2 . . . ......
#> TNFRSF9 . . . ......
#> PARK7 . 1.838231 1.355669 ......
#> SLC45A1 . . . ......
#> RERE . . . ......
#> RP5-1115A15.1 . . . ......
#> ENO1 . 1.509310 1.355669 ......
#> ENO1-AS1 . . . ......
#>
#> .....suppressing 2610 columns in show(); maybe adjust 'options(max.print= *, width = *)'
#> ..............................
We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). This analysis helps to highlight biological signal in single-cell datasets.
We will use theFindVariableFeatures() function. By default, we return 2,000 features per dataset. These will be used in downstream analysis, like PCA.
Why do you think it is important to look only at the most variable genes? Any ideas based off the pre class lecture?
pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)
# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot2
Next, we apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData() function:
Shifts the expression of each gene, so that the mean expression across cells is 0 Scales the expression of each gene, so that the variance across cells is 1 This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate The results of this are stored in pbmc[[“RNA”]]@scale.data
Finally, we are ready to run PCA. This can be achieved using the RunPCA() command.
After running PCA, we will want to determine how important each gene is in determining each PC. This might give us ideas about marker genes to use for downstream exploration. To determine each genes contribution, we want to visualize their loading scores. I touch on this in the PCA video here: https://www.youtube.com/watch?v=GtKBZGJbnAg.
pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
#> PC_ 1
#> Positive: CST3, TYROBP, LST1, AIF1, FTL
#> Negative: MALAT1, LTB, IL32, IL7R, CD2
#> PC_ 2
#> Positive: CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1
#> Negative: NKG7, PRF1, CST7, GZMB, GZMA
#> PC_ 3
#> Positive: HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1
#> Negative: PPBP, PF4, SDPR, SPARC, GNG11
#> PC_ 4
#> Positive: HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1
#> Negative: VIM, IL7R, S100A6, IL32, S100A8
#> PC_ 5
#> Positive: GZMB, NKG7, S100A8, FGFBP2, GNLY
#> Negative: LTB, IL7R, CKB, VIM, MS4A7
#Let's take a look at the loadings
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")
In particular DimHeatmap() allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores. Setting cells to a number plots the ‘extreme’ cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated feature sets.
DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)
DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)
As can be seen above, by PC10, the heatmaps are staring to show less defined clustering. This can be useful info when we decide how many PCs to carry over in future dimensionality reduction methods like UMAP and T-SNE
An Elbow plot can also be informative, and is very similar to the scree plot I showed in the video on PCA
ElbowPlot(pbmc)
Finally we can visualize how well our first two principle components segregate our data
DimPlot(pbmc, reduction = "pca")
We are starting to see some separation of data, but we can probably cluster these cells better. Tomorrow we will review PCA and look into UMAP and T-SNE, which will do a better job at clustering our data.
ElbowPlot(pbmc)