Class 4: Multi-modal data

This tutorial describes a basic CITE-seq analysis workflow.

Kent Riemondy https://github.com/kriemo (RNA Bioscience Initiative)https://medschool.cuanschutz.edu/rbi , Ryan Sheridan https://github.com/sheridar (RNA Bioscience Initiative)https://medschool.cuanschutz.edu/rbi
10-13-2021

Download the Rmarkdown here

Experimental design

CITE-seq enables detection of cell surface proteins and gene expression

CITE-seq reagents

Biolegend is the main company selling CITE-seq and cell hashing antibodies (TotalSeq). Biolegend reagents are divided into three product lines:

Cell hashing reagents

Cell hashing allows for sample multiplexing and “super-loaded” runs with >10,000 captured cells. Super-loading results in higher doublet rates (~10% for 10,000 captured cells), but these doublets can be removed by identifying cell barcodes that are associated with multiple hashtag oligos.

Biolegend cell hashing reagents for human cells include a mix of two antibodies that recognize CD298 and β2 microglobulin. Mouse hashing antibodies recognize CD45 and H-2 MHC class I.

TotalSeq-A reagents use a different PCR handle for CITE-seq and cell hashing antibodies. This means two separate libraries have to be prepared. To ensure that the correct libraries are created, it is import to tell the sequencing core which types of antibodies were included in the experiment.

TotalSeq-C reagents use the same PCR handle for CITE-seq and cell hashing antibodies, which means that only a single library will be prepared. However, to ensure that the correct libraries are created the core should be notified of all reagents used for the experiment.

MULTI-seq uses lipid- and cholesterol-modified oligonucleotides.

CellPlex was developed by 10x Genomics and is similar to MULTI-seq.


Creating a Seurat object with multiple assays

Loading counts matrices

The Read10X function can be used with the output directory generated by Cell Ranger. However, public datasets can be formatted in many different ways, so it is very useful to become familar with converting between formats. Our UMI count data is stored as comma-separated files, which we can load as data.frames and then convert to sparse matrices.

Our data today will be a collection of 4 PBMC samples that have each been “hashed” with different cell-hashtags, stained with 8 different CITE-seq antibodies, then combined and captured on the 10x platform in 1 sample.

We have three files that we will work with:

# Data URL
data_url <- "https://scrnaseq-workshop.s3-us-west-2.amazonaws.com"

# Function to import counts
import_counts <- function(file_name, file_url = data_url) {
  mtx <- file.path(file_url, file_name) %>%
    read_csv() %>%
    column_to_rownames(colnames(.[, 1])) %>%
    as.sparse()

  mtx
}

# Import gene expression matrix
rna_mtx <- import_counts("CITEseq_cDNA.csv.gz")

# Import CITE-seq matrix
adt_mtx <- import_counts("CITEseq_ADT.csv.gz")

rownames(adt_mtx) <- str_c("adt-", rownames(adt_mtx))
adt_mtx[, 1:10]
#> 8 x 10 sparse Matrix of class "dgCMatrix"
#>                                               
#> adt-CD14   11   2  14   1   .   1   1  . .   1
#> adt-CD19    9   . 165 295   .   3   6  4 2   1
#> adt-CD3     .  19  79   2   3   6   9  3 3   4
#> adt-CD4    69 360 293   7 159 321   1  7 7   4
#> adt-CD45    .   .   3   4   .   1   .  . 1   .
#> adt-CD45RA 25   3 246  16   8   5  18  6 4   9
#> adt-CD45RO  6   4 108   6   .   2   1  2 1   3
#> adt-CD8     7   .  36   6   .   . 473 21 7 197
# Import HTO matrix
hto_mtx <- import_counts("CITEseq_HTO.csv.gz")

hto_mtx[, 1:10]
#> 4 x 10 sparse Matrix of class "dgCMatrix"
#>                                             
#> HTO28 351   .   .  .   2 161   .   .   .   .
#> HTO29   6   2   3  1   1   2   2 107   2   .
#> HTO30   . 131 177 60   .   .   .   . 239 155
#> HTO44   1   .   .  . 122   . 172   .   1   .

Creating a Seurat object

When adding multiple assays to a Seurat object, we first must identify cell barcodes that are present in all of the datasets. If one of the assays has a different number of cell barcodes Seurat will throw an error.

# Get list of common cell barcodes
rna_bcs <- colnames(rna_mtx)
adt_bcs <- colnames(adt_mtx)
hto_bcs <- colnames(hto_mtx)

merged_bcs <- rna_bcs %>%
  intersect(adt_bcs) %>%
  intersect(hto_bcs)

# Create Seurat object
sobj <- CreateSeuratObject(
  counts    = rna_mtx[, merged_bcs],
  min.cells = 5
)

# Add CITE-seq and cell hashing data to Seurat object
sobj[["ADT"]] <- CreateAssayObject(adt_mtx[, merged_bcs])

sobj[["HTO"]] <- CreateAssayObject(hto_mtx[, merged_bcs])
  
sobj
#> An object of class Seurat 
#> 15032 features across 4292 samples within 3 assays 
#> Active assay: RNA (15020 features, 0 variable features)
#>  2 other assays present: ADT, HTO


Demultiplexing hashed samples

We will first use the cell hashing data to assign cells to their sample of origin. This is referred to as demultiplexing as we are using the cell hashing antibody barcodes to assign each cell to a sample.

Normalizing HTO counts

To account for differences in antibody binding efficiency, CITE-seq and cell hashing data can be normalized by performing a centered log-ratio transformation for each individual antibody.

Note that the best approach for normalizing CITE-seq data has not been settled. For a more detailed discussion, and alternative approaches, see the Normalization chapter from the Orchestrating Single Cell Analysis eBook from Bioconductor.

# Normalize HTO counts
sobj <- sobj %>%
  NormalizeData(
    assay = "HTO",
    normalization.method = "CLR"
  )

Sample demultiplexing and identification of doublets

To demultiplex hashed samples, the HTODemux function uses the normalized HTO counts for k-medoids clustering. This results in a cluster for each HTO. A background signal is then calculated for each HTO using cells that are not present in the HTO-specific cluster. Outlier cells from this background signal are then classified as being “positive” for the HTO if they are above a cutoff quantile.

Cells that are positive for multiple HTOs are classified as doublets and cells that are not positive for any HTO are classified as “negative” cells.

The HTODemux function automatically adds several columns to the object meta.data:

# Demultiplex samples
# By default HTODemux will look for the "HTO" assay
sobj <- sobj %>%
  HTODemux(positive.quantile = 0.97)

head(sobj@meta.data, 2)
#>                     orig.ident nCount_RNA nFeature_RNA nCount_ADT
#> AAACCTGAGACAAAGG SeuratProject       3054         1217        388
#> AAACCTGAGCTACCGC SeuratProject       2411          989        337
#>                  nFeature_ADT nCount_HTO nFeature_HTO HTO_maxID
#> AAACCTGAGACAAAGG            5        358            3     HTO28
#> AAACCTGAGCTACCGC            8        180            2     HTO30
#>                  HTO_secondID HTO_margin HTO_classification
#> AAACCTGAGACAAAGG        HTO29   3.211027              HTO28
#> AAACCTGAGCTACCGC        HTO29   3.448779              HTO30
#>                  HTO_classification.global hash.ID
#> AAACCTGAGACAAAGG                   Singlet   HTO28
#> AAACCTGAGCTACCGC                   Singlet   HTO30
# Summarize cell classifications
table(sobj$HTO_classification.global)
#> 
#>  Doublet Negative  Singlet 
#>      386        1     3905
# Create ridge plots showing HTO signal
sobj %>%
  RidgePlot(
    assay    = "HTO",
    features = rownames(hto_mtx),
    ncol     = 2
  )

Compare the number of cells with each hash.ID

# Calculate doublet rate
sobj@meta.data %>%
  group_by(HTO_classification.global) %>% 
  summarize(
    n     = n(),
    fract = n / nrow(.)
  )
#> # A tibble: 3 x 3
#>   HTO_classification.global     n    fract
#> * <chr>                     <int>    <dbl>
#> 1 Doublet                     386 0.0899  
#> 2 Negative                      1 0.000233
#> 3 Singlet                    3905 0.910
# Create bar graphs comparing cell count for each sample
dat <- sobj@meta.data %>%
  rownames_to_column("cell_id")

bars_1 <- dat %>%
  ggplot(aes(hash.ID, fill = hash.ID)) +
  geom_bar() +
  labs(y = "cell count") +
  theme_minimal() +
  theme(axis.title.x = element_blank())

bars_2 <- dat %>%
  ggplot(aes("PBMC", fill = hash.ID)) +
  geom_bar() +
  labs(y = "cell count") +
  theme_minimal() +
  theme(axis.title.x = element_blank())

# Combine plots
plot_grid(
  bars_1, bars_2,
  align       = "h",
  nrow        = 1,
  rel_widths  = c(1, 0.6)
)


Filtering data and assessing quality

Now that we have assigned each cell to the appropriate sample we can continue processing the gene expression data.

Assessing data quality

# Add mitochondrial percentage to meta.data table
sobj <- sobj %>%
  PercentageFeatureSet(
    assay    = "RNA",
    pattern  = "^MT-", 
    col.name = "percent_mito"
  )

# Create violin plots for gene expression data
sobj %>%
  VlnPlot(
    features = c("nCount_RNA", "nFeature_RNA", "percent_mito"), 
    ncol     = 3,
    pt.size  = 0
  )

# Aim to sequence CITE-seq libraries at 2k-5k reads/cell, cell hashing 1k-2k reads/cell
sobj %>%
  VlnPlot(
    features = c("nCount_ADT", "nCount_HTO"),
    ncol     = 2,
    pt.size  = 0,
    log      = TRUE
  )

# Filter cells based on HTO class, number of genes, and percent mito UMIs
filt_so <-  sobj %>%
  subset(
    nFeature_RNA > 250 &                    # Remove cells with < 250 detected genes
    nFeature_RNA < 2500 &                   # Remove cells with > 2500 detected genes (could be doublets)
    percent_mito < 15 &                     # Remove cells with > 0.15 mito/total reads
    HTO_classification.global == "Singlet"
  )

filt_so
#> An object of class Seurat 
#> 15032 features across 3686 samples within 3 assays 
#> Active assay: RNA (15020 features, 0 variable features)
#>  2 other assays present: ADT, HTO
# Rename cell identities with sample names
filt_so <- filt_so %>%
  RenameIdents(
    "HTO28" = "PBMC-1",
    "HTO29" = "PBMC-2",
    "HTO30" = "PBMC-3",
    "HTO44" = "PBMC-4"
  )

# Add sample names to meta.data table
filt_so$sample <- Idents(filt_so)

head(filt_so@meta.data, 2)
#>                     orig.ident nCount_RNA nFeature_RNA nCount_ADT
#> AAACCTGAGACAAAGG SeuratProject       3054         1217        388
#> AAACCTGAGCTACCGC SeuratProject       2411          989        337
#>                  nFeature_ADT nCount_HTO nFeature_HTO HTO_maxID
#> AAACCTGAGACAAAGG            5        358            3     HTO28
#> AAACCTGAGCTACCGC            8        180            2     HTO30
#>                  HTO_secondID HTO_margin HTO_classification
#> AAACCTGAGACAAAGG        HTO29   3.211027              HTO28
#> AAACCTGAGCTACCGC        HTO29   3.448779              HTO30
#>                  HTO_classification.global hash.ID percent_mito
#> AAACCTGAGACAAAGG                   Singlet   HTO28     7.105435
#> AAACCTGAGCTACCGC                   Singlet   HTO30     9.622563
#>                  sample
#> AAACCTGAGACAAAGG PBMC-1
#> AAACCTGAGCTACCGC PBMC-3


Process gene expression data

To review the basic Seurat processing workflow, see if you can complete the following steps for the gene expression data:

# Add your commands here
#
#
#
#
#
#
#
#
#
#


Clustering cells based on antibody signal

Normalize CITE-seq counts

Before clustering, we first want to normalize our CITE-seq data. Like cell hashing data, CITE-seq counts can be normalized by performing a centered log-ratio transformation.

# Set active.assay to ADT
# if you do not change the active.assay, you must specify the assay for
# each Seurat command
filt_so@active.assay <- "ADT"

# Normalize CITE-seq data
filt_so <-  filt_so %>%
  NormalizeData(
    assay                = "ADT",
    normalization.method = "CLR"
  ) %>%  
  ScaleData(assay = "ADT")

Cluster cells using antibody signal

Since there are only a few antibodies and not many dimensions, instead of performing PCA we can just use the scaled matrix for clustering directly. In contrast to the gene expression data, we have already done feature selection and dimensionality reduction by choosing (for very practical reasons) to only stain cells with a few antibodies that discriminate cell populations. However, if you stained cells with many (>100) antibodies, then you may want to select variable antibodies, and use PCA for clustering.

We will directly pass the ADT data to the FindNeighbors function, which creates a Shared Nearest Neighbor (SNN) graph that is used for clustering.

# Cluster cells
filt_so <- filt_so %>%
  FindNeighbors(
    assay     = "ADT",
    reduction = NULL,                        # we are not using a reduction
    dims      = NULL,                        # we do not specify dims
    features  = rownames(filt_so[["ADT"]])   # specify features to use
  ) %>%
  FindClusters(
    resolution = 1,
    graph.name = "ADT_snn"
  )

# For clarity store clusters in meta.data as 'ADT_clusters'
filt_so$ADT_clusters <- filt_so$ADT_snn_res.1

Run UMAP using antibody signal

# Run UMAP
filt_so <- filt_so %>%
  RunUMAP(
    assay          = "ADT",
    reduction      = NULL,
    dims           = NULL,
    features       = rownames(filt_so[["ADT"]]),
    reduction.name = "adt_umap",
    reduction.key  = "adtUMAP_"
  )

# Plot UMAPs
filt_so %>%
  DimPlot(
    reduction = "adt_umap", 
    group.by  = c("sample", "ADT_clusters"),
    ncol      = 2
  )

Identify marker proteins

# Identify differentially expressed proteins for each cluster
ADT_markers <- filt_so %>%
  FindAllMarkers(
    assay    = "ADT",
    only.pos = TRUE
  )

ADT_markers[-c(1, 7)]
#>             avg_log2FC pct.1 pct.2     p_val_adj cluster
#> adt-CD8      1.3552794 1.000 0.956 1.274197e-177       0
#> adt-CD3      0.3214778 0.957 0.853  4.922506e-34       0
#> adt-CD81     1.3678729 1.000 0.957 1.252873e-155       1
#> adt-CD4      2.2149898 1.000 0.920 7.418699e-173       2
#> adt-CD82     0.9782417 1.000 0.960  1.823298e-63       5
#> adt-CD41     2.0790827 1.000 0.923 3.236536e-126       6
#> adt-CD14     2.4677329 0.903 0.346 4.482614e-149       7
#> adt-CD45     1.1931465 0.946 0.611 1.129474e-105       7
#> adt-CD45RO   2.0208403 0.987 0.865  1.206924e-99       7
#> adt-CD19     0.5503366 0.993 0.913  2.780936e-76       7
#> adt-CD45RA   0.6816599 0.990 0.981  6.480328e-09       7
#> adt-CD42     1.4751528 1.000 0.923  2.275135e-77       8
#> adt-CD191    4.9623991 1.000 0.916  1.376132e-80       9
#> adt-CD45RA1  0.5971657 0.992 0.982  7.062440e-19       9
#> adt-CD45RO1  0.7362115 0.984 0.871  1.129237e-16       9
#> adt-CD43     1.4427406 1.000 0.929  8.271288e-06      10
#> adt-CD31     0.5441429 1.000 0.871  1.743421e-04      10
#> adt-CD83     0.4580862 1.000 0.963  7.012408e-02      10


Visualizing antibody signal

Overlay antibody signal on UMAPs

# CITE-seq features to plot
feats <- c(
  "adt-CD4", "rna_CD4",
  "adt-CD8", "rna_CD8A"
)

# Overlay antibody signal on gene expression UMAP
filt_so %>%
  FeaturePlot(
    reduction = "umap",
    features  = feats
  )
# Overlay antibody signal on antibody UMAP
filt_so %>%
  FeaturePlot(
    reduction = "adt_umap",
    features  = feats
  )

Ridge plots

# Create ridge plot
feats <- c(
  "adt-CD14", "adt-CD45",
  "adt-CD19", "adt-CD3",  
  "adt-CD4",  "adt-CD8"
)

filt_so %>%
  RidgePlot(features = feats)

Violin plots

# Create violin plots
feats <- c(
  "adt-CD4",  "rna_CD4",
  "adt-CD19", "rna_CD19"
)

filt_so %>%
  VlnPlot(
    features = feats,
    ncol     = 2
  )


Classifying cells based on antibody signal

We can use 2D scatter plots to derive “gates” to assign cell types in the same manner as done for flow cytometry analysis. Generally the CITE-seq signal has lower dynamic range and signal-to-noise compared to flow cytometry, but the overall patterns seen with flow cytometry are also seen with CITE-seq.

Identify CD19+ cells

CD19 is a marker of B cells, whereas CD3 is a marker of T cells. We can compare their expression to classify B-cells.

# Plot CD3 and CD19 signal
CD19_plot <- filt_so %>%
  FeatureScatter("adt-CD3", "adt-CD19")

CD19_plot

# Identify CD19+ cells using antibody signal
CD19_cells <- filt_so %>%
  subset(`adt-CD19` > 2.5 & `adt-CD3` < 1) %>%
  Cells()

head(CD19_cells)
#> [1] "AAACCTGAGCTACCGC" "AAAGCAATCTGTCCGT" "AACACGTGTCGCTTCT"
#> [4] "AACCATGGTCTGGTCG" "AACGTTGTCAATCACG" "AACTTTCCAGCCTTGG"
# Set cell identities
labeled_so <- filt_so %>%
  SetIdent(value = "Other") %>%
  SetIdent(value = "CD19+", cells = CD19_cells)

# Add cell identities to meta.data
labeled_so$CD19_class <- Idents(labeled_so)

# Label UMAP with new cell identities
labeled_so %>%
  DimPlot(reduction = "adt_umap")

Label cells using CellSelector

The CellSelector function allows you to interactively select cells from a scatter plot.

# Identify CD19+ cells using antibody signal
labeled_so <- filt_so %>%
  SetIdent(value = "Other")

labeled_so <- CellSelector(
  plot   = CD19_plot,
  object = labeled_so,
  ident  = "B cells"
)

# Label UMAP with new cell identities
labeled_so %>%
  DimPlot(reduction = "adt_umap")

Identify CD4+ and CD8+ cells

Identify CD4+ and CD8+ cells using one of the methods discussed above.

# Add your commands here
#
#
#
#
#
#
#
#
#
#


Examine CITE-seq derived cell labels on UMAPs

rna_umap <- labeled_so %>%
  DimPlot(reduction = "umap")

adt_umap <- labeled_so %>%
  DimPlot(reduction = "adt_umap")

plot_grid(rna_umap, adt_umap)


Combining CITE-seq and gene expression data

The authors of Seurat have implemented a new algorithm for combined data from different types of assays (e.g. antibody data and RNA data), described here and presented in a vignette.

The basic idea is that we want to cluster and visualize the data using both the RNA and antibody data. How to do this?

We could literally just merge the two datasets together then run them through Seurat. This will however end up downweighting the contribution of the antibody data because there are so few antibodies (8) compared to RNA features (~10,000)

Seurat has implemented a “Weighted Nearest Neighbor” approach that will combine dimensional reductions from the RNA data with the antibody data. The algorithm will calculate relative weights for the RNA and antibody data for each cell and use these new weights to construct a shared nearest neighbor graph. The related weights are calculated based on the relative information content of neighboring cells in each modality. So if the data from one modality is better at distinguishing the cells, it is weighted higher.

# Need to run PCA on the antibody data
filt_so <- filt_so %>%
  RunPCA(
    features       = rownames(filt_so[["ADT"]]),
    reduction.name = "adt_pca",
    reduction.key  = "adtPC_",
    assay          = "ADT",
    maxit          = 4000
  )

filt_so <- filt_so %>%
  FindMultiModalNeighbors(
    reduction.list       = list("pca", "adt_pca"),
    dims.list            = list(1:20, 1:6),
    modality.weight.name = c("RNA.weight", "ADT.weight")
  )

filt_so <- filt_so %>%
  RunUMAP(
    nn.name        = "weighted.nn",
    reduction.name = "wnn.umap", 
    reduction.key  = "wnnUMAP_"
  )
rna_umap  <- DimPlot(filt_so, reduction = "umap",     group.by = "ADT_clusters")
adt_umap  <- DimPlot(filt_so, reduction = "adt_umap", group.by = "ADT_clusters")
both_umap <- DimPlot(filt_so, reduction = "wnn.umap", group.by = "ADT_clusters")

plot_grid(
  rna_umap, adt_umap, both_umap,
  labels = c("RNA", "ADT", "Both"),
  nrow   = 1
)


Viewing results with the UCSC Cell Browser

The UCSC Cell Browser allows you to easily explore and share single-cell data. With the Cell Browser you can:

UCSC hosts a large collection of cellbrowsers for various single cell datasets.

Merge gene expression and antibody matrices

We won’t be generating these browsers in the class, however below is an example of how to build a browser. The first step is to combine the gene expression and antibody data into a single matrix.

# Combine RNA and ADT matrices
merged_so   <- filt_so
RNA_data    <- merged_so[["RNA"]]@data
ADT_data    <- merged_so[["ADT"]]@data
merged_data <- rbind(RNA_data, ADT_data)

# Add merged matrix to Seurat object
merged_so[["RNA"]]@data <- merged_data

# Set active assay
merged_so@active.assay <- "RNA"

Create Cell Browser files

To create the files needed for the session, we can use the make_cellbrowser function from the scbp package written by Kent Riemondy. This function extracts the required data from our Seurat object and generates configuration files needed to build the session. To include both the gene expression and antibody UMAPs, we generate separate sets of files for each.

# Create Cell Browser directories for gene expression data
dir.create(
  path      = "cellbrowser",
  recursive = TRUE
)

# meta.data fields to include in browser
feats <- c(
  "Sample"      = "sample",
  "clusters"    = "RNA_clusters",
  "ADT cluster" = "ADT_clusters",
  "Cell label"  = "cell_label",
  
  "RNA UMI count"     = "nCount_RNA",
  "Gene count"        = "nFeature_RNA",
  "Percent mito UMIs" = "percent_mito",
  "ADT UMI count"     = "nCount_ADT",
  "Antibody count"    = "nFeature_ADT",
  "HTO UMI count"     = "nCount_HTO",
  "HTO count"         = "nFeature_HTO"
)

# Create Cell Browser files for gene expression data
merged_so %>%
  make_cellbrowser(
    outdir      = "cellbrowser",
    column_list = feats,
    embeddings  = "umap",
    project     = "RNA"
  )
# Create Cell Browser files for antibody data
merged_so %>%
  make_cellbrowser(
    outdir      = "cellbrowser",
    column_list = feats,
    embeddings  = "adt_umap",
    project     = "ADT"
  )

Build Cell Browser session

To build the cell browser session, install cbBuild. This tool will build the session using the configuration files generated in the previous step.

mkdir -p cellbrowser/browser

cbBuild \
  -i cellbrowser/RNA/cellbrowser.conf \
  -i cellbrowser/ADT/cellbrowser.conf \
  -o cellbrowser/browser \
  -p 8888