Working with your own single cell data

Guidelines for datasets to work with in the workshop, information on where to get public datasets, and examples of how to load different data formats into R.

Kent Riemondy https://github.com/kriemo (RNA Bioscience Initiative)https://medschool.cuanschutz.edu/rbi
10-20-2021

Guidelines for datasets

We ask that each workshop participant selects a dataset to analyze while taking the workshop. Each lecture will use standardized datasets, however we will set aside time for attendees to discuss analysis of their datasets with the instructors. Working through your own dataset (or a relevant published dataset) will reinforce concepts taught in class.

In this article we will discuss:
- Guidelines and suggestions for the format of the dataset that you will analyze
- Various data repositories and other sources of public datasets
- Show examples of how to load various data formats into R for analysis with Seurat

Dataset format

The data that we will work with in the workshop will be count matrices. Count matrices are generally genes as rows and cells as columns, populated with read or UMI counts. These matrices are generated from pipelines such Cellranger (from 10x Genomics), or tools from academic labs such as Alevin (from the Patro lab), or Kallisto/Bustools (from the Pacther lab). These pipelines will align the raw FASTQ sequencing files, identify the barcodes associated with cell containing droplets, and output count matrices in various formats. Efficiently processing single cell datasets into count matrices generally requires more memory (RAM) and CPU power than is present on most laptops so we will not perform these steps in class. These steps are usually run on large compute clusters or on servers in the cloud. 10x Genomics offers a cloud service for running cellranger and the RBI has a pipeline for running cellranger on AWS.

Dataset size and complexity

As single cell datasets are continually growing in size (see article), so are the memory resources required for analyzing these datasets. A smaller dataset (e.g. ~5k cells) can be analyzed on a laptop with 8Gb of RAM without demanding too much memory. However, a dataset of ~50K cells generally maxes out the memory on a 2015 macbook pro with 16Gb of RAM.

To start analyzing single cell data, it is useful to learn the basics of the analysis working with 1 sample. However, 1 sample provides limited information, and is generally an insufficient dataset for learning new biology. Your dataset can therefore contain multiple samples, however this will increase the complexity of the analysis, particularly until we discuss methods for working with multiple samples in class 4.

Identifying public datasets

If you do not have a dataset in mind to analyze there are many sources.

library(scRNAseq)
listDatasets()
DataFrame with 60 rows and 5 columns
                 Reference  Taxonomy               Part    Number
               <character> <integer>        <character> <integer>
1   @aztekin2019identifi..      8355               tail     13199
2   @bach2017differentia..     10090      mammary gland     25806
3           @bacher2020low      9606            T cells    104417
4     @baron2016singlecell      9606           pancreas      8569
5     @baron2016singlecell     10090           pancreas      1886
...                    ...       ...                ...       ...
56    @zeisel2018molecular     10090     nervous system    160796
57     @zhao2020singlecell      9606 liver immune cells     68100
58    @zhong2018singlecell      9606  prefrontal cortex      2394
59  @zilionis2019singlec..      9606               lung    173954
60  @zilionis2019singlec..     10090               lung     17549
                      Call
               <character>
1        AztekinTailData()
2        BachMammaryData()
3        BacherTCellData()
4   BaronPancreasData('h..
5   BaronPancreasData('m..
...                    ...
56     ZeiselNervousData()
57   ZhaoImmuneLiverData()
58   ZhongPrefrontalData()
59      ZilionisLungData()
60  ZilionisLungData('mo..

Please contact the instructors if you have any difficulties finding an appropriate dataset

Load data from the wild into R

Cellranger output

Cellranger produces many output files. The files in the filtered_feature_bc_matrix directory contain the count matrix for cell-associated barcodes in a special sparseMatrix format (matrix market format) that can be loaded into R using a few different packages.

# read into R as a sparseMatrix
mat <- Seurat::Read10X()
# create a seurat object from the sparseMatrix
CreateSeuratObject(mat)

# alternatively, read into R as a SingleCellExperiment object for use with bioconductor
DropletUtils::read10xCounts()

From UCSC cellbrowser

Some datasets provide a Seurat object as an .rds file. Download this file if provided. If not, then the gene expression data will also be provided in a tsv file called exprMatrix.tsv.gz.

# to download and load an .rds file
download.file("https://cells.ucsc.edu/mouse-dev-neocortex/seurat.rds", "seurat.rds")
seurat_object <- readRDS("seurat.rds")
seurat_object
# to download and read in a .tsv file
#download.file("https://cells.ucsc.edu/mouse-dev-neocortex/exprMatrix.tsv.gz", "data.tsv.gz")

# slow way
mat <- read.table("data.tsv.gz")
# faster way, requires the data.table R package
mat <- data.table::fread("data.tsv.gz", sep = "\t", data.table = FALSE)

# move column "gene" to rownames and remove 
rownames(mat) <- mat[, 1]
mat[, 1] <- NULL

# convert to sparseMatrix to load into Seurat
mat <- as.sparse(mat)

CreateSeuratObject(mat)

From GEO

Data in GEO has no standarized format, so you will need to adapt the approach based on the files provided. Generally we try to upload data in a .tsv,.csv or a sparseMatrix format.

To load a .tsv/.csv file from a GEO record you can use a similar approach as used for the .tsv file from the UCSC browser.

# to download and read in a .tsv file
#download.file("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE113nnn/GSE113049/suppl/GSE113049_count_matrix.tsv.gz", "data.tsv.gz")

# faster way, requires the data.table R package
mat <- data.table::fread("data.tsv.gz", sep = "\t", data.table = FALSE)

# move column "V1" to rownames and remove 
rownames(mat) <- mat[, 1]
mat[, 1] <- NULL

# convert to sparseMatrix to load into Seurat
mat <- as.sparse(mat)

CreateSeuratObject(mat)

From the scRNAseq datasets

library(scRNAseq)
library(Seurat)
# select a dataset from listDatasets()

# assign to object to load into the rsession
sce <- ZhongPrefrontalData()

# convert to Seurat object from SingleCellExperiment
# sometimes this approach will error out.
# seurat_object <- as.Seurat(sce)
# alternatively just extract the raw UMI counts matrix
mat <- counts(sce)
CreateSeuratObject(mat)
An object of class Seurat 
24153 features across 2394 samples within 1 assay 
Active assay: RNA (24153 features, 0 variable features)

From the Alevin pipeline

If the data was generated by the Alevin pipelines you can use the tximport package to load the data into R. This process can be accelerated by also installing the fishpond package. Alevin will generate a file called quants_mat.gz which is a custom binary file with the counts matrix.

#pseudocode
files <- "path/to/alevin/quants_mat.gz"
txi <- tximport(files, type = "alevin")
mat <- as.sparse(txi$counts)
CreateSeuratObect(mat)