Guidelines for datasets to work with in the workshop, information on where to get public datasets, and examples of how to load different data formats into R.
We ask that each workshop participant selects a dataset to analyze while taking the workshop. Each lecture will use standardized datasets, however we will set aside time for attendees to discuss analysis of their datasets with the instructors. Working through your own dataset (or a relevant published dataset) will reinforce concepts taught in class.
In this article we will discuss:
- Guidelines and suggestions for the format of the dataset that you will analyze
- Various data repositories and other sources of public datasets
- Show examples of how to load various data formats into R for analysis with Seurat
The data that we will work with in the workshop will be count matrices. Count matrices are generally genes as rows and cells as columns, populated with read or UMI counts. These matrices are generated from pipelines such Cellranger (from 10x Genomics), or tools from academic labs such as Alevin (from the Patro lab), or Kallisto/Bustools (from the Pacther lab). These pipelines will align the raw FASTQ sequencing files, identify the barcodes associated with cell containing droplets, and output count matrices in various formats. Efficiently processing single cell datasets into count matrices generally requires more memory (RAM) and CPU power than is present on most laptops so we will not perform these steps in class. These steps are usually run on large compute clusters or on servers in the cloud. 10x Genomics offers a cloud service for running cellranger and the RBI has a pipeline for running cellranger on AWS.
As single cell datasets are continually growing in size (see article), so are the memory resources required for analyzing these datasets. A smaller dataset (e.g. ~5k cells) can be analyzed on a laptop with 8Gb of RAM without demanding too much memory. However, a dataset of ~50K cells generally maxes out the memory on a 2015 macbook pro with 16Gb of RAM.
To start analyzing single cell data, it is useful to learn the basics of the analysis working with 1 sample. However, 1 sample provides limited information, and is generally an insufficient dataset for learning new biology. Your dataset can therefore contain multiple samples, however this will increase the complexity of the analysis, particularly until we discuss methods for working with multiple samples in class 4.
If you do not have a dataset in mind to analyze there are many sources.
10x genomics 10x Genomics provides many datasets already processed through cellranger. You will need to register to gain access. The count matrix files to download, which contain only cell-associated barcodes, are called Feature / cell matrix (filtered)
.
UCSC cellbrowser: This is collection of published datsets that have been already analyzed and placed into an interactive web browser. This is a nice resource as you can use to look at the data, and compare your own analysis to this data. When you select a dataset you can click the Data Download
tab, and download the exprMatrix.tsv.gz
file. Some datasets also provide seurat objects for download as well (e.g. human lung airway dataset). Note that the data included in the matrices is often already normalized, and not integer UMI counts. You will therefore want skip normalization in Seurat if you use these datasets.
Gene Expression Omnibus. Published single cell RNA-seq experiments should have their raw data deposited into GEO. Navigating GEO to find datasets is difficult and therefore it is better to first find a publication, then link to the publications dataset. A count matrix should be included as processed data with the submission, however not all datasets have these, and the data formats and naming conventions are not standardized. An example dataset from a mouse lung injury experiment is here, with the GSE113049_count_matrix.tsv.gz
being the relevant UMI count matrix.
Bioconductor package: scRNAseq A curated selection of single cell datasets have been organized into a bioconductor pacakge called scRNAseq
. These datasets are provided as SingleCellExperiment
objects, which is the bioconductor data structure used for storing and working with single cell datasets. These can be easily converted to and from other data structures, such as Seurat, as shown in the load data from the wild section.
library(scRNAseq)
listDatasets()
DataFrame with 60 rows and 5 columns
Reference Taxonomy Part Number
<character> <integer> <character> <integer>
1 @aztekin2019identifi.. 8355 tail 13199
2 @bach2017differentia.. 10090 mammary gland 25806
3 @bacher2020low 9606 T cells 104417
4 @baron2016singlecell 9606 pancreas 8569
5 @baron2016singlecell 10090 pancreas 1886
... ... ... ... ...
56 @zeisel2018molecular 10090 nervous system 160796
57 @zhao2020singlecell 9606 liver immune cells 68100
58 @zhong2018singlecell 9606 prefrontal cortex 2394
59 @zilionis2019singlec.. 9606 lung 173954
60 @zilionis2019singlec.. 10090 lung 17549
Call
<character>
1 AztekinTailData()
2 BachMammaryData()
3 BacherTCellData()
4 BaronPancreasData('h..
5 BaronPancreasData('m..
... ...
56 ZeiselNervousData()
57 ZhaoImmuneLiverData()
58 ZhongPrefrontalData()
59 ZilionisLungData()
60 ZilionisLungData('mo..
Please contact the instructors if you have any difficulties finding an appropriate dataset
Cellranger produces many output files. The files in the filtered_feature_bc_matrix
directory contain the count matrix for cell-associated barcodes in a special sparseMatrix format (matrix market format) that can be loaded into R using a few different packages.
# read into R as a sparseMatrix
mat <- Seurat::Read10X()
# create a seurat object from the sparseMatrix
CreateSeuratObject(mat)
# alternatively, read into R as a SingleCellExperiment object for use with bioconductor
DropletUtils::read10xCounts()
Some datasets provide a Seurat object as an .rds
file. Download this file if provided. If not, then the gene expression data will also be provided in a tsv file called exprMatrix.tsv.gz
.
# to download and load an .rds file
download.file("https://cells.ucsc.edu/mouse-dev-neocortex/seurat.rds", "seurat.rds")
seurat_object <- readRDS("seurat.rds")
seurat_object
# to download and read in a .tsv file
#download.file("https://cells.ucsc.edu/mouse-dev-neocortex/exprMatrix.tsv.gz", "data.tsv.gz")
# slow way
mat <- read.table("data.tsv.gz")
# faster way, requires the data.table R package
mat <- data.table::fread("data.tsv.gz", sep = "\t", data.table = FALSE)
# move column "gene" to rownames and remove
rownames(mat) <- mat[, 1]
mat[, 1] <- NULL
# convert to sparseMatrix to load into Seurat
mat <- as.sparse(mat)
CreateSeuratObject(mat)
Data in GEO has no standarized format, so you will need to adapt the approach based on the files provided. Generally we try to upload data in a .tsv
,.csv
or a sparseMatrix format.
To load a .tsv/.csv file from a GEO record you can use a similar approach as used for the .tsv
file from the UCSC browser.
# to download and read in a .tsv file
#download.file("https://ftp.ncbi.nlm.nih.gov/geo/series/GSE113nnn/GSE113049/suppl/GSE113049_count_matrix.tsv.gz", "data.tsv.gz")
# faster way, requires the data.table R package
mat <- data.table::fread("data.tsv.gz", sep = "\t", data.table = FALSE)
# move column "V1" to rownames and remove
rownames(mat) <- mat[, 1]
mat[, 1] <- NULL
# convert to sparseMatrix to load into Seurat
mat <- as.sparse(mat)
CreateSeuratObject(mat)
library(scRNAseq)
library(Seurat)
# select a dataset from listDatasets()
# assign to object to load into the rsession
sce <- ZhongPrefrontalData()
# convert to Seurat object from SingleCellExperiment
# sometimes this approach will error out.
# seurat_object <- as.Seurat(sce)
# alternatively just extract the raw UMI counts matrix
mat <- counts(sce)
CreateSeuratObject(mat)
An object of class Seurat
24153 features across 2394 samples within 1 assay
Active assay: RNA (24153 features, 0 variable features)
If the data was generated by the Alevin pipelines you can use the tximport
package to load the data into R. This process can be accelerated by also installing the fishpond
package. Alevin will generate a file called quants_mat.gz
which is a custom binary file with the counts matrix.
#pseudocode
files <- "path/to/alevin/quants_mat.gz"
txi <- tximport(files, type = "alevin")
mat <- as.sparse(txi$counts)
CreateSeuratObect(mat)