Intro_scRNAseq_QC_PCA

The goal of this lecture is to give an overview of scRNA seq, quality control, filtering, and briefly touch on PCA

Tyler Matheny
2021-11-01

intro to RBI

,

RC1 South room9101

office hours on Thursday afternoons

Welcome

from seurat pbmc example

A couple notes on programming

For some, or many of you, this might be your first time programming. Since you are taking this course, I probably don’t have to spend a ton of time convincing you of the benefits of learning how to code. But I want to briefly touch on just a few of the benefits before proceeding with the lecture material.

benefits
learning curve

Languages of Bioinformatics

When it comes to bioinformatics, there are really 3 main languages/frameworks you will consistantly work with. These are: bash, R, and python.

languages/frameworks

Why R?

In this course, we will almost exclusively be working with R. R is a great first language to learn, because it is primarily used for data cleaning, analysis, visualization, and statistics. It is also a great place to start, because many bioinformatics packages are written in R. Finally, if you have done a lot of work in Excel (as most scientists have), the fundamental concepts and functions in R, should be fairly*** straightforward to pick up.

***with some practice

Where to go when you get stuck

One misconception I had when I started programming was that programmers have the knowledge to carry out most analyses prior to writing code. Hollywood has created this archetype of the ‘whiz-kid’ or ‘misunderstood genius hacker’ who write code at lightning speed and who just know what code needs to be written and exactly how to write it. This is not how it works in the real world. Most programmers I know spend far more time googling and thinking than writing code.

You are going to get stuck a lot.

You are going to get a lot of error messages.

You are going to make mistakes.

The key is to stick with it and to know where to go when you have problems

  1. Stack overflow

  2. RBI

  3. Online tutorials

  4. Biostars and SeqAnswers

  5. Papers

… Now that we’ve covered an overview of programming tips, let’s switch gears and switch gears to the reason you are all here

Single-cell Sequencing

RNA seq is similar to a fruit smoothie. The individual fruits (strawberries, bananas, apples, oranges) that would make up the smoothie would be analogous to the different cell types in a population. You blend up the fruits into the smoothie (lyse cells, extract RNA, and make libraries), and then you drink it (run it on a sequencer).

But when you drink a smoothie, its harder to tell which flavors came from which fruits than if you were to eat each fruit individually. Especially if you had something unexpected, like kale***, in the smoothie. This is analogous to bulk RNA-sequencing. In bulk RNA seq, you lose all information about the cell of origin for each read. If you wanted to know the RNA expression in a certain cell type, you had to try to purify that cell type from the smoothie and sequence it. Thereby losing all information about other cell types in your samples.

***yuck

Enter single cell seq. Single cell sequencing technologies allow a researcher to tag individual cells with a unique sequence, called a barcode, that allows you to know which cell a read originates from. Therefore, you can computationally parse which reads came.

Commonly used paradigms for single cell preps.

Droplet based

Through microfluics wizardy, cell gets loaded into a droplet with a bead with all the DNA bits needed to do the transcript barcoding

TECHNOLOGICAL AWESOMENESS

What’s in the drop?

In theory, each drop will have (1) a cell and (2) a DNA bit that will do the barcoding. This DNA bit contains a unique barcode, a UMI (for correcting PCR amplification), and an oligodT stretch (because mRNA).

T3 prime 10x chemistry

Combinatorial Indexing

Through some clever (simple) math: (1) a small number of cells are loaded into each well a multi-well plate and barcoded. Each well has a different barcode (2) Then they are pooled back together (3) The process is repeated with a plate containing another set of barcodes. The probability that 2 cells pick up the exact same combintation of barcodes becomes vanishingly small.

MATHEMATICAL AWESOMENESS

sci-fi seq

Combines the two methods and overloads the droplets

AWESOMENESS**2

Tradeoffs between bulk RNA seq vs. single-cell seq

“There are no solutions, only trade-offs”. And in the case of bulk RNA seq vs single-cell RNA seq, this is certainly true… for now. The information you gain about the cellular origin read from comes at a price:

  1. only a small portion of the mRNAs from each cell are captured

  2. low number of detectable genes, might not detect low expressing genes at all

  3. little info outside of gene counts

Think about the question you are trying to answer before running your experiment

  1. Almost all sequencing experiments fail or succeed at the design step.

  2. Programming cannot fix bad data. Crap in crap out.

alignment pipelines

cellranger performs alignment, filtering, UMI counting, clustering, and gene expression analysis.

Alternatives such as Alevin, STARsolo, Kallisto

# in terminal
cellranger count --id=123 \ 
                 --transcriptome=/refdata-cellranger-GRCh38-3.0.0 \ 
                 # build new transcriptome if you have GFP/RFP/transgene
                 --fastqs=/home/runs/HAT7ADXX/outs/fastq_path \ 
                 # a list of fastqs, a folder, or pass table via csv file
                 --sample=mysample 
                 # will need additional arguments for feature barcoding

output folder structure

KO_1_CDNA
|-- KO_1_CDNA.mri.tgz
|-- SC_RNA_COUNTER_CS
|   |-- CLOUPE_PREPROCESS
|   |-- EXPAND_SAMPLE_DEF
|   |-- SC_RNA_COUNTER
|   `-- fork0
|-- _cmdline
|-- _filelist
|-- _finalstate
|-- _invocation
|-- _jobmode
|-- _log
|-- _mrosource
|-- _perf
|-- _sitecheck
|-- _tags
|-- _timestamp
|-- _uuid
|-- _vdrkill
|-- _versions
`-- outs
    |-- analysis
    |-- cloupe.cloupe # cloupe file for browser view
    |-- filtered_feature_bc_matrix # use this fold for seurat
    |-- filtered_feature_bc_matrix.h5
    |-- metrics_summary.csv
    |-- molecule_info.h5
    |-- possorted_genome_bam.bam
    |-- possorted_genome_bam.bam.bai
    |-- raw_feature_bc_matrix # or this for seurat
    |-- raw_feature_bc_matrix.h5
    `-- web_summary.html # qc summary

~ 5-8 hours per sample on biochem department cluster “Bodhi”

theoretically can be ran locally on linux, but will require at least 32GB of RAM

other campus options include Rosalind, AWS https://github.com/rnabioco/cellrangerAWS

cellranger output html and loupe files

reading and making UMI-barcode elbow plots

model of final 10x library

The total UMI (unique molecular identifier - represent each transcript) of a cell barcode is used to rank the barcodes determine the UMI threshold for signal vs noise. A plot is generated for cellranger html output, but will also be useful in other situations like hashing and CITE-seq.

example of good and bad data

for a standard 10x scRNAseq run:

        expected range of x axis (barcode) : ~ 10^5 (if using ggplot to visualize, filter)

        expected inflection point of x axis (cell number): ~ # of loaded cells / 2

        expected range of y axis (UMI_counts) : ~ 10^4

        expected inflection point of y axis (cutoff UMI count): ~ 1000

library(tidyverse)
# use "raw" instead of "filtered" cellranger output folder
data_url = "https://scrnaseq-workshop.s3-us-west-2.amazonaws.com"
m1 <- readRDS(url(file.path(data_url, "raw_matrix.rds")))
# all genes x all barcodes
counts <- Matrix::colSums(m1) # calculate total UMI read number for each cell barcode
countdf <- as.data.frame(counts) %>% 
  as_tibble(rownames = "barcode") %>% 
  filter(counts >= 2) %>% # throw out cell barcodes with 1 or less UMI, this is mainly for time purposes
  arrange(desc(counts)) %>% # arrange by descending order
  mutate(rank = 1:n()) # rank
head(countdf) # barcodes now ranked by UMI counts
#> # A tibble: 6 × 3
#>   barcode          counts  rank
#>   <chr>             <dbl> <int>
#> 1 GCAGCCACATACCGTA  40745     1
#> 2 ATGGAGGGTGGTAACG  20916     2
#> 3 GTAACCATCGCTTGAA  19958     3
#> 4 TTCATGTGTCGTGTTA  19765     4
#> 5 GAATAGACATCCTGTC  18494     5
#> 6 GTCTCACGTTGGCCGT  17828     6
ggplot(countdf, aes(x = rank, y = counts)) +
  geom_point() +
  labs(x = "barcodes", y = "UMI_counts") +
  theme_classic() +
  scale_x_log10() + 
  scale_y_log10()

Creating a Seurat object

Single-cell RNA-seq counts are usually stored as a sparse matrix due to the high percentage of zeros. In a sparse matrix zeros are removed and only non-zero values are stored, which saves memory and speeds up operations.

The Read10X function can be used with the output directory generated by 10X Cell Ranger to load the counts data as a sparse matrix.

# single cell analysis suite
library(Seurat)
# data frame manipulation and %>% functions
library(tidyverse)
# Import matrix of counts
pbmc.data <- Read10X(data.dir = "filtered_gene_bc_matrices/hg19/")

pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
 
#min.cells- Include features detected in at least this many cells. Will subset the counts matrix as well. To reintroduce excluded features, create a new object with a lower cutoff.


#min.features- Include cells where at least this many features are detected.

head(pbmc@meta.data, 5)
#>                  orig.ident nCount_RNA nFeature_RNA
#> AAACATACAACCAC-1     pbmc3k       2419          779
#> AAACATTGAGCTAC-1     pbmc3k       4903         1352
#> AAACATTGATCAGC-1     pbmc3k       3147         1129
#> AAACCGTGCTTCCG-1     pbmc3k       2639          960
#> AAACCGTGTATGCG-1     pbmc3k        980          521
# lets look at come of the count data.  It is called a sparse matrix because zeros are represented as '.' saves memory
pbmc.data[c("CD3D", "TCL1A", "MS4A1"), 1:30]
#> 3 x 30 sparse Matrix of class "dgCMatrix"
#>                                                                    
#> CD3D  4 . 10 . . 1 2 3 1 . . 2 7 1 . . 1 3 . 2  3 . . . . . 3 4 1 5
#> TCL1A . .  . . . . . . 1 . . . . . . . . . . .  . 1 . . . . . . . .
#> MS4A1 . 6  . . . . . . 1 1 1 . . . . . . . . . 36 1 2 . . 2 . . . .

Digging into the seurat object.

At the top level, the Seurat object serves as a collection of Assay and DimReduc objects, representing expression data and dimensionality reductions of the expression data, respectively. The Assay objects are designed to hold expression data of a single type, such as RNA-seq gene expression, CITE-seq ADTs, cell hashtags, or imputed gene values. DimReduc objects represent transformations of the data contained within the Assay object(s) via various dimensional reduction techniques such as PCA.

See : https://github.com/satijalab/seurat/wiki/Seurat#slots for a more detailed explanation of the Seurat object

pbmc
#> An object of class Seurat 
#> 13714 features across 2700 samples within 1 assay 
#> Active assay: RNA (13714 features, 0 variable features)
# dim provides both nrow and ncol at the same time
dim(x = pbmc)
#> [1] 13714  2700
# Import matrix of counts
head(x=rownames(x=pbmc))
#> [1] "AL627309.1"    "AP006222.2"    "RP11-206L10.2" "RP11-206L10.9"
#> [5] "LINC00115"     "NOC2L"
# Import matrix of counts
head(x = colnames(x = pbmc))
#> [1] "AAACATACAACCAC-1" "AAACATTGAGCTAC-1" "AAACATTGATCAGC-1"
#> [4] "AAACCGTGCTTCCG-1" "AAACCGTGTATGCG-1" "AAACGCACTGGTAC-1"
Pulling specific Assay, DimReduc, or Graph objects can be done with the double [[ extract operator. Adding new objects to a Seurat object is also done with the double [[ extract operator; Seurat will figure out where in the Seurat object a new associated object belongs.
# A vector of names of associated objects can be had with the names function.

names(x = pbmc)
#> [1] "RNA"
# IThese can be passed to the double [[ extract operator to pull them from the Seurat object
pbmc[['RNA']]
#> Assay data with 13714 features for 2700 cells
#> First 10 features:
#>  AL627309.1, AP006222.2, RP11-206L10.2, RP11-206L10.9,
#> LINC00115, NOC2L, KLHL17, PLEKHN1, RP11-54O7.17, HES4
# Looking at the metadata.  We will add another column to this later when we look at the percentage of reads per cell mapping to mitochondrial transcripts
head(pbmc@meta.data)
#>                  orig.ident nCount_RNA nFeature_RNA
#> AAACATACAACCAC-1     pbmc3k       2419          779
#> AAACATTGAGCTAC-1     pbmc3k       4903         1352
#> AAACATTGATCAGC-1     pbmc3k       3147         1129
#> AAACCGTGCTTCCG-1     pbmc3k       2639          960
#> AAACCGTGTATGCG-1     pbmc3k        980          521
#> AAACGCACTGGTAC-1     pbmc3k       2163          781

QC and selecting cells for further analysis

Seurat allows you to easily explore QC metrics and filter cells based on any user-defined criteria. A few QC metrics commonly used by the community include

The number of unique genes detected in each cell.

Low-quality cells or empty droplets will often have very few genes.

Cell doublets or multiplets may exhibit an aberrantly high gene count.

Similarly, the total number of molecules detected within a cell (correlates strongly with unique genes)

Filtering Mito genes

The percentage of reads that map to the mitochondrial genome is indicative of dying cells.

Low-quality / dying cells often exhibit extensive mitochondrial contamination.

We calculate mitochondrial QC metrics with the PercentageFeatureSet() function, which calculates the percentage of counts originating from a set of features We use the set of all genes starting with MT- as a set of mitochondrial genes

#We will add a column to the metadata calculating the percentage of genes mapping to mitochondrial transcripts
pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")
#We can now see that the metadata now includes the percentage of mitochondrial genes
head(pbmc@meta.data, 5)
#>                  orig.ident nCount_RNA nFeature_RNA percent.mt
#> AAACATACAACCAC-1     pbmc3k       2419          779  3.0177759
#> AAACATTGAGCTAC-1     pbmc3k       4903         1352  3.7935958
#> AAACATTGATCAGC-1     pbmc3k       3147         1129  0.8897363
#> AAACCGTGCTTCCG-1     pbmc3k       2639          960  1.7430845
#> AAACCGTGTATGCG-1     pbmc3k        980          521  1.2244898
#We can also peak into the metadata by examining just a particular column.  Similar to dplyr functions.  Try erasing 'percent mito' and tab completing
head(x = pbmc$percent.mt)
#> AAACATACAACCAC-1 AAACATTGAGCTAC-1 AAACATTGATCAGC-1 AAACCGTGCTTCCG-1 
#>        3.0177759        3.7935958        0.8897363        1.7430845 
#> AAACCGTGTATGCG-1 AAACGCACTGGTAC-1 
#>        1.2244898        1.6643551
# Visualize QC metrics as a violin plot
VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

Subsetting

We want to remove low quality cells for our downstream analyses. Using the plot above, we can draw some cutoffs by eye. Here we will filter out cells by getting rid of cells that have unique feature counts over 2,500 or less than 200. We also filter cells that have >5% mitochondrial counts

pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
#Note that some cells were filtered out- We started with 2700
dim(x = pbmc)
#> [1] 13714  2638

Normaliztion

After removing unwanted cells from the dataset, the next step is to normalize the data. By default, we employ a global-scaling normalization method “LogNormalize” that normalizes the feature expression measurements for each cell by the total expression, multiplies this by a scale factor (10,000 by default), and log-transforms the result. Normalized values are stored in pbmc[[“RNA”]]@data.

Think of this step as normalizing expression of each transcripts to the read depth for all genes in that cell. The log transofrmation will compress lowly and highly expressed genes to a more similar scale

pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
head(pbmc[["RNA"]]@data,100)
#> 100 x 2638 sparse Matrix of class "dgCMatrix"
#>                                                                      
#> AL627309.1    .        .        .        .        . .        .       
#> AP006222.2    .        .        .        .        . .        .       
#> RP11-206L10.2 .        .        .        .        . .        .       
#> RP11-206L10.9 .        .        .        .        . .        .       
#> LINC00115     .        .        .        .        . .        .       
#> NOC2L         .        .        .        .        . .        .       
#> KLHL17        .        .        .        .        . .        .       
#> PLEKHN1       .        .        .        .        . .        .       
#> RP11-54O7.17  .        .        .        .        . .        .       
#> HES4          .        .        .        .        . .        .       
#> RP11-54O7.11  .        .        .        .        . .        .       
#> ISG15         .        .        1.429744 3.558310 . 1.726902 .       
#> AGRN          .        .        .        .        . .        .       
#> C1orf159      .        .        .        .        . .        .       
#> TNFRSF18      .        1.625141 .        .        . .        .       
#> TNFRSF4       .        .        .        .        . .        .       
#> SDF4          .        .        1.429744 .        . .        .       
#> B3GALT6       .        .        .        .        . .        1.722356
#> FAM132A       .        .        .        .        . .        .       
#> UBE2J2        .        .        .        .        . .        .       
#> ACAP3         .        .        .        .        . .        .       
#> PUSL1         .        .        .        .        . .        1.722356
#> CPSF3L        .        .        .        .        . .        .       
#> GLTPD1        .        .        .        .        . .        .       
#> DVL1          .        .        .        .        . .        .       
#> MXRA8         .        .        .        .        . .        .       
#> AURKAIP1      .        1.111715 .        1.566387 . .        .       
#> CCNL2         .        .        .        .        . .        .       
#> RP4-758J18.2  .        .        .        .        . .        .       
#> MRPL20        1.635873 .        1.429744 .        . .        .       
#> ATAD3C        .        .        .        .        . .        .       
#> ATAD3B        .        .        .        .        . .        .       
#> ATAD3A        .        .        .        .        . .        .       
#> SSU72         .        1.111715 .        2.515108 . .        .       
#> AL645728.1    .        .        .        .        . .        .       
#> C1orf233      .        .        .        .        . .        .       
#> RP11-345P4.9  .        .        .        .        . .        .       
#> MIB2          .        1.625141 .        .        . .        1.722356
#> MMP23B        .        .        .        .        . .        .       
#> CDK11B        .        .        .        .        . .        .       
#> SLC35E2B      .        .        .        .        . .        .       
#> CDK11A        .        .        .        .        . .        .       
#> SLC35E2       .        .        .        .        . .        .       
#> NADK          .        .        .        1.566387 . .        .       
#> GNB1          .        .        .        .        . .        .       
#> RP1-140A9.1   .        .        .        .        . .        .       
#> TMEM52        .        .        .        .        . .        .       
#> PRKCZ         .        .        .        .        . .        .       
#> RP5-892K4.1   .        .        .        .        . .        .       
#> C1orf86       .        .        .        .        . .        .       
#> AL590822.2    .        .        .        .        . .        .       
#> SKI           .        .        .        .        . .        .       
#> RER1          .        1.111715 1.429744 1.566387 . .        .       
#> PEX10         .        .        .        .        . .        .       
#> PLCH2         .        .        .        .        . .        .       
#> PANK4         .        .        1.429744 .        . 1.726902 .       
#> RP3-395M20.12 .        1.111715 .        .        . .        .       
#> TNFRSF14      .        .        .        1.566387 . 1.726902 .       
#> RP3-395M20.9  .        .        .        .        . .        .       
#> FAM213B       .        .        .        .        . .        .       
#> MMEL1         .        .        .        .        . .        .       
#> TTC34         .        .        .        .        . .        .       
#> MEGF6         .        .        .        .        . .        .       
#> TPRG1L        .        .        .        .        . .        .       
#> WRAP73        .        .        .        .        . .        .       
#> TP73-AS1      .        .        .        .        . .        .       
#> SMIM1         .        .        .        .        . .        .       
#> LRRC47        .        .        1.429744 .        . .        .       
#> CEP104        .        .        .        .        . .        .       
#> DFFB          .        .        .        .        . .        .       
#> C1orf174      .        .        .        .        . .        .       
#> NPHP4         .        .        .        .        . .        .       
#> KCNAB2        .        .        .        .        . .        .       
#> RPL22         1.635873 1.962726 1.429744 2.149274 . .        .       
#> RNF207        .        .        .        .        . .        .       
#> ICMT          .        .        .        .        . .        .       
#> GPR153        .        .        .        .        . .        .       
#> ACOT7         .        .        .        .        . .        1.722356
#> RP1-202O8.3   .        .        .        .        . .        .       
#> ESPN          .        .        .        .        . .        .       
#> TNFRSF25      2.226555 .        .        .        . .        .       
#> PLEKHG5       .        .        .        .        . .        .       
#> NOL9          .        .        .        .        . .        .       
#> ZBTB48        .        .        .        .        . .        .       
#> KLHL21        .        .        .        .        . .        .       
#> PHF13         .        .        .        .        . .        .       
#> THAP3         .        .        .        .        . .        .       
#> DNAJC11       .        .        .        .        . .        .       
#> RP11-312B8.1  .        .        .        .        . .        .       
#> CAMTA1        .        1.111715 .        .        . .        .       
#> VAMP3         .        .        .        .        . .        .       
#> PER3          .        .        .        .        . .        .       
#> UTS2          .        .        .        .        . .        .       
#> TNFRSF9       .        .        .        .        . .        .       
#> PARK7         .        .        1.995416 .        . .        .       
#> SLC45A1       .        .        .        .        . .        .       
#> RERE          .        .        .        .        . .        1.722356
#> RP5-1115A15.1 .        .        .        .        . .        .       
#> ENO1          .        1.625141 1.995416 2.149274 . .        1.722356
#> ENO1-AS1      .        .        .        .        . .        .       
#>                                                                    
#> AL627309.1    .        .        .        .        .        .       
#> AP006222.2    .        .        .        .        .        .       
#> RP11-206L10.2 .        .        .        .        .        .       
#> RP11-206L10.9 .        .        .        .        .        .       
#> LINC00115     .        .        .        .        .        .       
#> NOC2L         .        .        .        .        1.646272 .       
#> KLHL17        .        .        .        .        .        .       
#> PLEKHN1       .        .        .        .        .        .       
#> RP11-54O7.17  .        .        .        .        .        .       
#> HES4          .        .        .        .        .        .       
#> RP11-54O7.11  .        .        .        .        .        .       
#> ISG15         .        .        3.339271 .        .        1.638876
#> AGRN          .        .        .        .        .        .       
#> C1orf159      .        .        .        .        .        .       
#> TNFRSF18      .        .        .        .        .        .       
#> TNFRSF4       .        2.179642 .        .        .        .       
#> SDF4          .        .        .        .        .        .       
#> B3GALT6       .        .        .        .        .        .       
#> FAM132A       .        .        .        .        .        .       
#> UBE2J2        .        2.179642 .        1.268336 .        .       
#> ACAP3         .        .        .        .        .        .       
#> PUSL1         .        .        .        .        .        .       
#> CPSF3L        .        .        .        1.268336 .        .       
#> GLTPD1        .        .        .        .        .        .       
#> DVL1          .        .        .        .        .        .       
#> MXRA8         .        .        .        .        .        .       
#> AURKAIP1      1.690977 .        .        1.809904 .        2.229881
#> CCNL2         .        .        .        .        .        .       
#> RP4-758J18.2  .        .        .        1.268336 .        .       
#> MRPL20        1.690977 .        .        1.809904 .        .       
#> ATAD3C        .        .        .        .        .        .       
#> ATAD3B        .        .        .        .        1.646272 .       
#> ATAD3A        .        .        .        .        .        .       
#> SSU72         .        .        .        .        .        .       
#> AL645728.1    .        .        .        .        .        .       
#> C1orf233      .        .        .        .        .        .       
#> RP11-345P4.9  .        .        .        .        .        .       
#> MIB2          .        .        .        .        .        .       
#> MMP23B        .        .        .        .        .        .       
#> CDK11B        .        .        .        .        .        .       
#> SLC35E2B      .        .        .        .        .        .       
#> CDK11A        .        .        .        .        .        .       
#> SLC35E2       .        .        .        .        .        .       
#> NADK          .        .        .        .        .        .       
#> GNB1          .        2.179642 .        .        .        .       
#> RP1-140A9.1   .        .        .        .        .        .       
#> TMEM52        .        .        .        .        .        .       
#> PRKCZ         .        .        .        .        .        .       
#> RP5-892K4.1   .        .        .        .        .        .       
#> C1orf86       .        2.179642 .        .        .        .       
#> AL590822.2    .        .        .        .        .        .       
#> SKI           .        .        .        .        .        .       
#> RER1          .        .        .        .        1.646272 .       
#> PEX10         .        .        .        .        .        .       
#> PLCH2         .        .        .        .        .        .       
#> PANK4         .        .        .        .        .        .       
#> RP3-395M20.12 .        .        .        .        .        .       
#> TNFRSF14      .        .        .        1.268336 .        .       
#> RP3-395M20.9  .        .        .        .        .        .       
#> FAM213B       .        .        .        .        .        .       
#> MMEL1         .        .        .        .        .        .       
#> TTC34         .        .        .        .        .        .       
#> MEGF6         .        .        .        .        .        .       
#> TPRG1L        .        .        .        .        .        .       
#> WRAP73        .        .        .        .        .        .       
#> TP73-AS1      .        .        .        .        1.646272 .       
#> SMIM1         .        .        .        .        .        .       
#> LRRC47        .        .        .        .        .        .       
#> CEP104        .        .        .        .        .        .       
#> DFFB          .        .        .        .        .        .       
#> C1orf174      .        2.179642 .        .        .        .       
#> NPHP4         .        .        .        .        .        .       
#> KCNAB2        .        .        .        .        .        .       
#> RPL22         1.690977 .        .        2.159268 2.238069 2.229881
#> RNF207        .        .        .        .        .        .       
#> ICMT          .        .        .        .        .        .       
#> GPR153        .        .        .        .        .        .       
#> ACOT7         .        .        .        .        .        .       
#> RP1-202O8.3   .        .        .        .        .        .       
#> ESPN          .        .        .        .        .        .       
#> TNFRSF25      1.690977 .        .        .        .        .       
#> PLEKHG5       .        .        .        .        .        .       
#> NOL9          .        .        .        .        .        .       
#> ZBTB48        .        .        .        .        .        .       
#> KLHL21        .        .        .        .        .        .       
#> PHF13         .        .        .        .        .        .       
#> THAP3         1.690977 .        .        .        .        .       
#> DNAJC11       .        .        .        .        .        .       
#> RP11-312B8.1  .        .        .        .        .        .       
#> CAMTA1        .        .        .        .        .        .       
#> VAMP3         .        .        .        .        .        .       
#> PER3          .        .        .        .        .        .       
#> UTS2          .        .        .        .        .        .       
#> TNFRSF9       .        .        .        .        .        .       
#> PARK7         .        .        2.309182 1.268336 .        2.229881
#> SLC45A1       .        .        .        .        .        .       
#> RERE          .        .        .        .        .        .       
#> RP5-1115A15.1 .        .        .        .        .        .       
#> ENO1          .        .        .        1.268336 .        .       
#> ENO1-AS1      .        .        .        .        .        .       
#>                                                                    
#> AL627309.1    .        .        .        .        .        .       
#> AP006222.2    .        .        .        .        .        .       
#> RP11-206L10.2 .        .        .        .        .        .       
#> RP11-206L10.9 .        .        .        .        .        .       
#> LINC00115     .        .        .        .        .        .       
#> NOC2L         .        .        .        .        .        .       
#> KLHL17        .        .        .        .        .        .       
#> PLEKHN1       .        .        .        .        .        .       
#> RP11-54O7.17  .        .        .        .        .        .       
#> HES4          .        .        .        1.157353 .        .       
#> RP11-54O7.11  .        .        .        .        .        .       
#> ISG15         2.861362 .        3.267762 1.157353 .        .       
#> AGRN          .        .        .        .        .        .       
#> C1orf159      .        .        .        .        .        .       
#> TNFRSF18      .        .        .        .        .        .       
#> TNFRSF4       .        .        .        .        3.737279 .       
#> SDF4          .        .        .        .        1.485076 1.102225
#> B3GALT6       .        .        .        .        .        .       
#> FAM132A       .        .        .        .        .        .       
#> UBE2J2        1.457932 .        .        .        1.485076 .       
#> ACAP3         .        2.270898 .        .        .        .       
#> PUSL1         .        .        .        .        .        .       
#> CPSF3L        .        .        .        .        .        .       
#> GLTPD1        .        .        .        .        .        .       
#> DVL1          .        .        .        .        .        .       
#> MXRA8         .        .        .        .        .        .       
#> AURKAIP1      1.457932 .        .        .        .        .       
#> CCNL2         1.457932 .        .        1.679524 1.485076 .       
#> RP4-758J18.2  .        .        .        .        .        .       
#> MRPL20        1.457932 .        .        .        1.485076 .       
#> ATAD3C        .        .        .        .        .        .       
#> ATAD3B        .        .        .        .        .        1.102225
#> ATAD3A        .        .        .        .        .        .       
#> SSU72         1.457932 .        .        2.020819 .        1.613772
#> AL645728.1    .        .        .        .        .        .       
#> C1orf233      .        .        .        .        .        .       
#> RP11-345P4.9  .        .        .        .        .        .       
#> MIB2          1.457932 .        .        .        .        .       
#> MMP23B        .        .        .        .        .        .       
#> CDK11B        .        .        .        .        .        .       
#> SLC35E2B      .        .        .        .        .        .       
#> CDK11A        1.457932 .        .        .        .        .       
#> SLC35E2       .        .        .        .        .        .       
#> NADK          2.027376 .        .        .        .        .       
#> GNB1          .        .        .        1.157353 .        1.102225
#> RP1-140A9.1   .        .        .        .        .        .       
#> TMEM52        .        .        .        .        .        .       
#> PRKCZ         .        .        .        .        .        .       
#> RP5-892K4.1   .        .        .        .        .        .       
#> C1orf86       .        .        .        .        .        .       
#> AL590822.2    .        .        .        .        .        .       
#> SKI           .        .        .        .        .        .       
#> RER1          1.457932 .        .        1.679524 .        .       
#> PEX10         .        .        .        .        .        .       
#> PLCH2         .        .        .        .        .        .       
#> PANK4         .        .        .        .        .        .       
#> RP3-395M20.12 .        .        .        .        .        .       
#> TNFRSF14      .        .        .        1.157353 .        .       
#> RP3-395M20.9  .        .        .        .        .        .       
#> FAM213B       .        .        .        .        .        .       
#> MMEL1         .        .        .        .        .        .       
#> TTC34         .        .        .        .        .        .       
#> MEGF6         .        .        .        .        .        .       
#> TPRG1L        .        .        .        .        .        .       
#> WRAP73        .        .        .        .        .        .       
#> TP73-AS1      .        .        .        .        .        .       
#> SMIM1         .        .        .        .        .        .       
#> LRRC47        .        .        .        .        .        .       
#> CEP104        .        .        .        .        .        .       
#> DFFB          .        .        .        .        .        .       
#> C1orf174      .        .        .        .        .        1.102225
#> NPHP4         .        .        .        .        .        .       
#> KCNAB2        .        .        .        1.157353 .        1.102225
#> RPL22         1.457932 2.270898 3.267762 2.274803 1.485076 2.569949
#> RNF207        .        .        .        .        .        .       
#> ICMT          .        .        .        .        .        .       
#> GPR153        .        .        .        .        .        .       
#> ACOT7         .        .        .        .        .        .       
#> RP1-202O8.3   .        .        .        .        .        .       
#> ESPN          .        .        .        .        .        .       
#> TNFRSF25      .        .        .        .        .        .       
#> PLEKHG5       .        .        .        .        .        .       
#> NOL9          .        .        .        .        .        .       
#> ZBTB48        .        .        .        .        .        .       
#> KLHL21        .        .        .        .        .        .       
#> PHF13         .        .        .        .        .        .       
#> THAP3         1.457932 .        .        .        .        .       
#> DNAJC11       .        .        .        .        .        .       
#> RP11-312B8.1  .        .        .        .        .        .       
#> CAMTA1        .        .        .        .        .        .       
#> VAMP3         1.457932 .        .        .        .        .       
#> PER3          .        .        .        .        .        .       
#> UTS2          .        .        .        .        .        .       
#> TNFRSF9       .        .        .        .        .        .       
#> PARK7         .        .        .        2.020819 1.485076 1.950553
#> SLC45A1       .        .        .        .        .        .       
#> RERE          .        .        .        .        .        .       
#> RP5-1115A15.1 .        .        .        .        .        .       
#> ENO1          1.457932 .        .        2.645395 1.485076 2.202039
#> ENO1-AS1      .        .        .        .        .        .       
#>                                                                    
#> AL627309.1    .        .        .        .        .        .       
#> AP006222.2    .        .        .        .        .        .       
#> RP11-206L10.2 .        .        .        .        .        .       
#> RP11-206L10.9 .        .        .        .        .        .       
#> LINC00115     .        .        .        .        .        .       
#> NOC2L         .        1.398186 .        .        .        .       
#> KLHL17        .        .        .        .        .        .       
#> PLEKHN1       .        .        .        .        .        .       
#> RP11-54O7.17  .        .        .        .        .        .       
#> HES4          .        .        .        .        .        .       
#> RP11-54O7.11  .        .        .        .        .        .       
#> ISG15         .        1.398186 .        1.553327 1.670007 .       
#> AGRN          .        .        .        .        .        .       
#> C1orf159      .        .        .        .        .        .       
#> TNFRSF18      .        .        .        .        .        .       
#> TNFRSF4       .        .        .        .        .        .       
#> SDF4          .        .        .        .        1.670007 .       
#> B3GALT6       .        .        .        .        .        .       
#> FAM132A       .        .        .        .        .        .       
#> UBE2J2        .        .        .        .        .        .       
#> ACAP3         .        .        .        .        .        .       
#> PUSL1         .        .        .        .        .        .       
#> CPSF3L        .        .        .        .        .        .       
#> GLTPD1        .        .        .        .        .        .       
#> DVL1          .        .        .        .        .        .       
#> MXRA8         .        .        .        .        .        .       
#> AURKAIP1      .        1.398186 .        1.553327 .        .       
#> CCNL2         .        .        .        .        .        .       
#> RP4-758J18.2  .        .        .        .        .        .       
#> MRPL20        .        .        .        1.553327 .        .       
#> ATAD3C        .        .        .        .        .        .       
#> ATAD3B        .        .        .        .        .        .       
#> ATAD3A        .        .        .        .        .        .       
#> SSU72         .        .        2.309999 .        .        2.089658
#> AL645728.1    .        .        .        .        .        .       
#> C1orf233      .        .        .        .        .        .       
#> RP11-345P4.9  .        .        .        .        .        .       
#> MIB2          .        .        .        .        .        .       
#> MMP23B        .        .        .        .        .        .       
#> CDK11B        .        .        .        1.553327 .        .       
#> SLC35E2B      .        .        .        .        .        .       
#> CDK11A        .        .        .        1.553327 .        .       
#> SLC35E2       .        .        .        .        .        .       
#> NADK          .        .        .        .        .        .       
#> GNB1          .        .        .        .        1.670007 .       
#> RP1-140A9.1   .        .        .        .        .        .       
#> TMEM52        .        .        .        .        .        .       
#> PRKCZ         .        .        .        .        .        .       
#> RP5-892K4.1   .        .        .        .        .        .       
#> C1orf86       .        .        .        .        .        .       
#> AL590822.2    .        .        .        .        .        .       
#> SKI           .        .        .        .        .        .       
#> RER1          .        .        .        .        .        .       
#> PEX10         .        .        .        .        .        .       
#> PLCH2         .        .        .        .        .        .       
#> PANK4         .        .        .        .        .        .       
#> RP3-395M20.12 .        .        .        .        .        .       
#> TNFRSF14      2.184526 .        .        .        1.670007 .       
#> RP3-395M20.9  .        .        .        .        .        .       
#> FAM213B       .        .        .        .        .        .       
#> MMEL1         .        .        .        .        .        .       
#> TTC34         .        .        .        .        .        .       
#> MEGF6         .        .        .        .        .        .       
#> TPRG1L        .        .        .        .        .        .       
#> WRAP73        .        .        .        .        .        .       
#> TP73-AS1      .        .        .        .        .        .       
#> SMIM1         .        .        .        .        .        .       
#> LRRC47        .        .        .        .        .        .       
#> CEP104        .        .        .        .        .        .       
#> DFFB          .        .        .        .        .        .       
#> C1orf174      .        .        .        .        .        2.089658
#> NPHP4         .        .        .        .        .        .       
#> KCNAB2        .        .        .        .        .        2.718944
#> RPL22         3.482647 2.579565 .        1.553327 2.264302 2.089658
#> RNF207        .        .        .        .        .        .       
#> ICMT          .        .        .        .        .        .       
#> GPR153        .        .        .        .        .        .       
#> ACOT7         .        .        .        .        .        .       
#> RP1-202O8.3   .        .        .        .        .        .       
#> ESPN          .        .        .        .        .        .       
#> TNFRSF25      .        .        .        .        .        .       
#> PLEKHG5       .        .        .        .        .        .       
#> NOL9          .        .        .        .        .        .       
#> ZBTB48        .        .        .        .        .        .       
#> KLHL21        .        .        .        .        .        .       
#> PHF13         .        .        .        .        .        .       
#> THAP3         .        .        .        .        .        .       
#> DNAJC11       .        .        .        .        .        .       
#> RP11-312B8.1  .        .        .        .        .        .       
#> CAMTA1        .        .        .        .        .        4.533330
#> VAMP3         .        .        .        .        .        .       
#> PER3          .        .        .        .        .        .       
#> UTS2          .        .        .        .        .        .       
#> TNFRSF9       .        .        .        .        .        .       
#> PARK7         .        1.398186 .        .        .        .       
#> SLC45A1       .        .        .        .        .        .       
#> RERE          .        .        .        .        .        .       
#> RP5-1115A15.1 .        .        .        .        .        .       
#> ENO1          .        1.959489 .        .        1.670007 .       
#> ENO1-AS1      .        .        .        .        .        .       
#>                                                
#> AL627309.1    .        .        .        ......
#> AP006222.2    .        .        .        ......
#> RP11-206L10.2 .        .        .        ......
#> RP11-206L10.9 .        .        .        ......
#> LINC00115     .        .        .        ......
#> NOC2L         .        .        .        ......
#> KLHL17        .        .        .        ......
#> PLEKHN1       .        .        .        ......
#> RP11-54O7.17  .        1.015884 .        ......
#> HES4          .        .        .        ......
#> RP11-54O7.11  .        .        .        ......
#> ISG15         .        1.509310 .        ......
#> AGRN          .        .        .        ......
#> C1orf159      .        .        .        ......
#> TNFRSF18      .        .        .        ......
#> TNFRSF4       .        .        .        ......
#> SDF4          .        .        .        ......
#> B3GALT6       .        .        .        ......
#> FAM132A       .        .        .        ......
#> UBE2J2        .        1.015884 .        ......
#> ACAP3         .        .        .        ......
#> PUSL1         .        .        .        ......
#> CPSF3L        .        .        .        ......
#> GLTPD1        .        .        .        ......
#> DVL1          .        .        .        ......
#> MXRA8         .        .        .        ......
#> AURKAIP1      .        1.015884 .        ......
#> CCNL2         .        .        .        ......
#> RP4-758J18.2  .        .        .        ......
#> MRPL20        .        .        .        ......
#> ATAD3C        .        .        .        ......
#> ATAD3B        .        .        .        ......
#> ATAD3A        .        .        1.355669 ......
#> SSU72         .        .        .        ......
#> AL645728.1    .        .        .        ......
#> C1orf233      .        1.015884 .        ......
#> RP11-345P4.9  .        .        .        ......
#> MIB2          .        .        .        ......
#> MMP23B        .        .        .        ......
#> CDK11B        .        .        .        ......
#> SLC35E2B      .        .        .        ......
#> CDK11A        .        1.015884 .        ......
#> SLC35E2       .        .        .        ......
#> NADK          .        .        .        ......
#> GNB1          .        .        .        ......
#> RP1-140A9.1   .        .        .        ......
#> TMEM52        .        .        .        ......
#> PRKCZ         .        .        .        ......
#> RP5-892K4.1   .        .        .        ......
#> C1orf86       .        1.838231 .        ......
#> AL590822.2    .        .        .        ......
#> SKI           .        .        .        ......
#> RER1          .        .        1.355669 ......
#> PEX10         .        .        .        ......
#> PLCH2         .        .        .        ......
#> PANK4         .        .        .        ......
#> RP3-395M20.12 .        .        .        ......
#> TNFRSF14      .        1.015884 .        ......
#> RP3-395M20.9  .        .        .        ......
#> FAM213B       .        .        .        ......
#> MMEL1         .        .        .        ......
#> TTC34         .        .        .        ......
#> MEGF6         .        .        .        ......
#> TPRG1L        .        .        .        ......
#> WRAP73        .        .        .        ......
#> TP73-AS1      .        .        .        ......
#> SMIM1         .        .        .        ......
#> LRRC47        .        .        .        ......
#> CEP104        .        .        .        ......
#> DFFB          .        .        .        ......
#> C1orf174      .        .        .        ......
#> NPHP4         .        .        .        ......
#> KCNAB2        .        .        .        ......
#> RPL22         2.726919 1.509310 3.051899 ......
#> RNF207        .        .        .        ......
#> ICMT          .        .        .        ......
#> GPR153        .        .        .        ......
#> ACOT7         .        .        .        ......
#> RP1-202O8.3   .        .        .        ......
#> ESPN          .        .        .        ......
#> TNFRSF25      .        .        .        ......
#> PLEKHG5       .        .        .        ......
#> NOL9          .        .        .        ......
#> ZBTB48        .        .        .        ......
#> KLHL21        .        .        .        ......
#> PHF13         .        .        .        ......
#> THAP3         .        .        .        ......
#> DNAJC11       .        .        .        ......
#> RP11-312B8.1  .        .        .        ......
#> CAMTA1        .        .        1.910832 ......
#> VAMP3         .        .        .        ......
#> PER3          .        .        .        ......
#> UTS2          .        .        .        ......
#> TNFRSF9       .        .        .        ......
#> PARK7         .        1.838231 1.355669 ......
#> SLC45A1       .        .        .        ......
#> RERE          .        .        .        ......
#> RP5-1115A15.1 .        .        .        ......
#> ENO1          .        1.509310 1.355669 ......
#> ENO1-AS1      .        .        .        ......
#> 
#>  .....suppressing 2610 columns in show(); maybe adjust 'options(max.print= *, width = *)'
#>  ..............................

Identification of highly variable features (feature selection)

We next calculate a subset of features that exhibit high cell-to-cell variation in the dataset (i.e, they are highly expressed in some cells, and lowly expressed in others). This analysis helps to highlight biological signal in single-cell datasets.

We will use theFindVariableFeatures() function. By default, we return 2,000 features per dataset. These will be used in downstream analysis, like PCA.

Why do you think it is important to look only at the most variable genes? Any ideas based off the pre class lecture?

pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(pbmc), 10)

# plot variable features with and without labels
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
plot2

Scaling the data

Next, we apply a linear transformation (‘scaling’) that is a standard pre-processing step prior to dimensional reduction techniques like PCA. The ScaleData() function:

Shifts the expression of each gene, so that the mean expression across cells is 0 Scales the expression of each gene, so that the variance across cells is 1 This step gives equal weight in downstream analyses, so that highly-expressed genes do not dominate The results of this are stored in pbmc[[“RNA”]]@scale.data

all.genes <- rownames(pbmc)
pbmc <- ScaleData(pbmc, features = all.genes)
View(pbmc[["RNA"]]@scale.data)

PCA

Finally, we are ready to run PCA. This can be achieved using the RunPCA() command.

After running PCA, we will want to determine how important each gene is in determining each PC. This might give us ideas about marker genes to use for downstream exploration. To determine each genes contribution, we want to visualize their loading scores. I touch on this in the PCA video here: https://www.youtube.com/watch?v=GtKBZGJbnAg.

pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))
# Examine and visualize PCA results a few different ways
print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
#> PC_ 1 
#> Positive:  CST3, TYROBP, LST1, AIF1, FTL 
#> Negative:  MALAT1, LTB, IL32, IL7R, CD2 
#> PC_ 2 
#> Positive:  CD79A, MS4A1, TCL1A, HLA-DQA1, HLA-DQB1 
#> Negative:  NKG7, PRF1, CST7, GZMB, GZMA 
#> PC_ 3 
#> Positive:  HLA-DQA1, CD79A, CD79B, HLA-DQB1, HLA-DPB1 
#> Negative:  PPBP, PF4, SDPR, SPARC, GNG11 
#> PC_ 4 
#> Positive:  HLA-DQA1, CD79B, CD79A, MS4A1, HLA-DQB1 
#> Negative:  VIM, IL7R, S100A6, IL32, S100A8 
#> PC_ 5 
#> Positive:  GZMB, NKG7, S100A8, FGFBP2, GNLY 
#> Negative:  LTB, IL7R, CKB, VIM, MS4A7
#Let's take a look at the loadings
VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

In particular DimHeatmap() allows for easy exploration of the primary sources of heterogeneity in a dataset, and can be useful when trying to decide which PCs to include for further downstream analyses. Both cells and features are ordered according to their PCA scores. Setting cells to a number plots the ‘extreme’ cells on both ends of the spectrum, which dramatically speeds plotting for large datasets. Though clearly a supervised analysis, we find this to be a valuable tool for exploring correlated feature sets.

DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)

DimHeatmap(pbmc, dims = 1:15, cells = 500, balanced = TRUE)

As can be seen above, by PC10, the heatmaps are staring to show less defined clustering. This can be useful info when we decide how many PCs to carry over in future dimensionality reduction methods like UMAP and T-SNE

An Elbow plot can also be informative, and is very similar to the scree plot I showed in the video on PCA

ElbowPlot(pbmc)

Finally we can visualize how well our first two principle components segregate our data

DimPlot(pbmc, reduction = "pca")

We are starting to see some separation of data, but we can probably cluster these cells better. Tomorrow we will review PCA and look into UMAP and T-SNE, which will do a better job at clustering our data.

ElbowPlot(pbmc)