Single cell RNA-seq class I

Author

Kent Riemondy

Published

October 21, 2024

Overview

Today we will begin a two-course segment on single cell RNA-seq analysis. We will begin with an introduction to single cell RNA-seq data, followed by a discussion of key quality control metrics used to exclude low-quality cells from the data.

Analyzing single cell data in R

Droplet based scRNA-seq will produce libraries with 100,000 - 1,000,000 cell barcodes in each experiment, despite only 1-10,000 cells being loaded. Many of these barcodes are generated by non-cell containing empty droplets, which will end up with a few reads.

Storing a matrix of 20,000 genes x 1,000,000 cell barcodes would require a large amount of memory (20 billion values). However most of these values (> 95%) are zeros due to many empty barcodes with few UMIs and the low efficiency of single cell library generation (< 10-20% efficient). To limit unnecessary memory usage single cell data matrices are stored in a sparseMatrix format.

# typical dense matrix format
vals <- c(
  0, 0, 0, 2, 0,
  0, 1, 0, 0, 0,
  0, 0, 0, 0, 0,
  0, 0, 0, 0, 1,
  0, 0, 0, 0, 0
)

m <- matrix(vals, nrow = 5)
m

     [,1] [,2] [,3] [,4] [,5]
[1,]    0    0    0    0    0
[2,]    0    1    0    0    0
[3,]    0    0    0    0    0
[4,]    2    0    0    0    0
[5,]    0    0    0    1    0

Converting a dense matrix to a sparse matrix can be accomplished using the generic as function (which is a common way to convert formats).

sm <- as(m, "sparseMatrix")
# alternatively
# Matrix(vals, nrow = 5)
sm

5 x 5 sparse Matrix of class "dtCMatrix"
              
[1,] . . . . .
[2,] . 1 . . .
[3,] . . . . .
[4,] 2 . . . .
[5,] . . . 1 .

The sparse matrix format only stores non-zero values in the matrix. Internally , and in text formats, these are stored as a row column value triplet.

summary(sm)

5 x 5 sparse Matrix of class "dtCMatrix", with 3 entries
  i j x
1 4 1 2
2 2 2 1
3 5 4 1

Many of the functions that manipulate matrices (e.g. rowMeans, colSums, apply, [) can also be used on sparseMatrices, provided that you load the Matrix package (e.g library(Matrix)).

How can we extract the first 2 rows and first 3 columns of the sparse matrix sm that we generated above?

# print subset of sm
sm[1:2, 1:3]

How can we calculate the sum of the columns of sm?

# find column sums of sparse
colSums(sm)

Base R subsetting

To work with single cell data it will be helpful to know some base R concepts in such as subsetting with brackets [ and referencing and generating columns with $.

Vectors in R can be subset by vector index (position), a logical vector (c(TRUE, FALSE)), or name (if the vector is named).

Consider the following character vector letters.

letters

 [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s"
[20] "t" "u" "v" "w" "x" "y" "z"

# extract 2nd, 4th, and 6th entry
letters[c(2, 4, 6)]

[1] "b" "d" "f"

# subset by creating logical vector
vowels <- c("a", "e", "i", "o", "u")
is_a_vowel <- letters %in% vowels

letters[is_a_vowel]

[1] "a" "e" "i" "o" "u"

# name the letters vector with uppercase LETTERS
names(letters) <- LETTERS

# subset by name
letters[c("A", "Z")]

  A   Z 
"a" "z"

Matrices are 2 dimensional vectors and have similar subsetting rules except there are two dimensions, rows and columns.

matrix[rows_to_subset, columns_to_subset]

m <- matrix(1:24, nrow = 6)

# extract 2nd, 4th, and 6th row
m[c(2, 4, 6), ]

     [,1] [,2] [,3] [,4]
[1,]    2    8   14   20
[2,]    4   10   16   22
[3,]    6   12   18   24

# extract 2nd and 4th column
m[, c(2, 4)]

     [,1] [,2]
[1,]    7   19
[2,]    8   20
[3,]    9   21
[4,]   10   22
[5,]   11   23
[6,]   12   24

# first 3 rows and 2nd and 4th column
m[1:3, c(2, 4)]

     [,1] [,2]
[1,]    7   19
[2,]    8   20
[3,]    9   21

# extract rows with totals > 50
m[rowSums(m) > 50, ]

     [,1] [,2] [,3] [,4]
[1,]    4   10   16   22
[2,]    5   11   17   23
[3,]    6   12   18   24

# extract columns with minimum values < 8
m[, colMins(m) < 8]

     [,1] [,2]
[1,]    1    7
[2,]    2    8
[3,]    3    9
[4,]    4   10
[5,]    5   11
[6,]    6   12

The base R data.frame and Bioconductor DataFrame can also be subset with the [ and we can reference individual vectors in a data.frame using $.

# first 3 rows and columns of mtcars data.frame
mtcars[1:3, 1:3]

               mpg cyl disp
Mazda RX4     21.0   6  160
Mazda RX4 Wag 21.0   6  160
Datsun 710    22.8   4  108

# columns can be referenced using $, which extracts a vector
mtcars$mpg

 [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
[16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
[31] 15.0 21.4

# columns can be generated or overwritten using $ with assignment
mtcars$new_column_name <- "Hello!"
mtcars$wt <- mtcars$wt * 1000

# We can subset using logical vectors
# E.g. filter for rows (cars) with mpg > 20
mtcars[mtcars$mpg > 20, ]

                mpg cyl  disp  hp drat   wt  qsec vs am gear carb
Mazda RX4      21.0   6 160.0 110 3.90 2620 16.46  0  1    4    4
Mazda RX4 Wag  21.0   6 160.0 110 3.90 2875 17.02  0  1    4    4
Datsun 710     22.8   4 108.0  93 3.85 2320 18.61  1  1    4    1
Hornet 4 Drive 21.4   6 258.0 110 3.08 3215 19.44  1  0    3    1
Merc 240D      24.4   4 146.7  62 3.69 3190 20.00  1  0    4    2
Merc 230       22.8   4 140.8  95 3.92 3150 22.90  1  0    4    2
Fiat 128       32.4   4  78.7  66 4.08 2200 19.47  1  1    4    1
Honda Civic    30.4   4  75.7  52 4.93 1615 18.52  1  1    4    2
Toyota Corolla 33.9   4  71.1  65 4.22 1835 19.90  1  1    4    1
Toyota Corona  21.5   4 120.1  97 3.70 2465 20.01  1  0    3    1
Fiat X1-9      27.3   4  79.0  66 4.08 1935 18.90  1  1    4    1
Porsche 914-2  26.0   4 120.3  91 4.43 2140 16.70  0  1    5    2
Lotus Europa   30.4   4  95.1 113 3.77 1513 16.90  1  1    5    2
Volvo 142E     21.4   4 121.0 109 4.11 2780 18.60  1  1    4    2
               new_column_name
Mazda RX4               Hello!
Mazda RX4 Wag           Hello!
Datsun 710              Hello!
Hornet 4 Drive          Hello!
Merc 240D               Hello!
Merc 230                Hello!
Fiat 128                Hello!
Honda Civic             Hello!
Toyota Corolla          Hello!
Toyota Corona           Hello!
Fiat X1-9               Hello!
Porsche 914-2           Hello!
Lotus Europa            Hello!
Volvo 142E              Hello!

Reading alevin output into R with tximport

Our old-friend, the tximport package, has methods for importing the binary data from alevin. We need to supply a path to the quants_mat.gz file. Note that in contrast to importing data from salmon, tximport only allows 1 file to be loaded. If you want to load multiple samples use iteration approaches (e.g. lapply, purrr::map, a for loop). Also note that the eds package was installed which greatly speeds up the loading of the matrix.

We will load in data from a 10x Genomics scRNA-seq library generated from human periperhal blood mononuclear cells (PMBCS).

library(tximport)
tx <- tximport(
  here("data/block-rna/scrna/pbmc/alevin/quants_mat.gz"),
  type = "alevin"
)

reading in alevin gene-level counts across cells with 'eds'

names(tx)

[1] "abundance"           "counts"              "countsFromAbundance"

tx is a list with 3 elements, abundance, counts, and countsFromAbundance. Let’s look at the counts element

mat <- tx$counts
mat[5:10, 1:3]

6 x 3 sparse Matrix of class "dgCMatrix"
                GCTGCAGTCCGATCTC ACTATGGAGGTCCCTG ATTTCTGTCTCTATGT
ENSG00000243485                .                .                .
ENSG00000284332                .                .                .
ENSG00000237613                .                .                .
ENSG00000268020                .                .                .
ENSG00000290826                .                .                .
ENSG00000240361                .                .                .

Here you can see that tx$counts is a sparse matrix that is genes (rows) by cells (columns).

How many barcodes are in tx$counts? How many genes?

# TODO Find number of barcodes and genes in tx$counts
dim(mat)

What fraction of the matrix is non-zero? We can use the nnzerofunction from the Matrix package check

nnzero(mat) / length(mat) # (length = # of rows X # of columns)
# similarily
sum(mat > 0) / length(mat)

The SingleCellExperiment class

Single cell analysis is centered around the SingleCellExperiment data structure, which serves as a single container to store the input data, various transformations (count data, normalized count data, PCA, UMAPs, etc.) and any results that are generated.

Creating a SingleCellExperiment object

A SingleCellExperiment object can be created from our sparse matrix using the SingleCellExperiment() function.

sce <- SingleCellExperiment(list(counts = mat))
sce

class: SingleCellExperiment 
dim: 62266 6075 
metadata(0):
assays(1): counts
rownames(62266): ENSG00000290825 ENSG00000223972 ... ENSG00000210195
  ENSG00000210196
rowData names(0):
colnames(6075): GCTGCAGTCCGATCTC ACTATGGAGGTCCCTG ... ACGTAGGGTGACAGCA
  TCTCAGCTCGCCGAAC
colData names(0):
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

The SingleCellExperiment object stores the gene x cell count matrix within assays().

# get list of assays
assays(sce)

List of length 1
names(1): counts

# extract single assay
assay(sce, "counts")[1:4, 1:4]

4 x 4 sparse Matrix of class "dgCMatrix"
                GCTGCAGTCCGATCTC ACTATGGAGGTCCCTG ATTTCTGTCTCTATGT
ENSG00000290825                .                .                .
ENSG00000223972                .                .                .
ENSG00000227232                .                .                .
ENSG00000278267                .                .                .
                TATCTGTAGGTGATAT
ENSG00000290825                .
ENSG00000223972                .
ENSG00000227232                .
ENSG00000278267                .

assays(sce)$counts[1:4, 1:4]

4 x 4 sparse Matrix of class "dgCMatrix"
                GCTGCAGTCCGATCTC ACTATGGAGGTCCCTG ATTTCTGTCTCTATGT
ENSG00000290825                .                .                .
ENSG00000223972                .                .                .
ENSG00000227232                .                .                .
ENSG00000278267                .                .                .
                TATCTGTAGGTGATAT
ENSG00000290825                .
ENSG00000223972                .
ENSG00000227232                .
ENSG00000278267                .

# convenience function for counts assay
counts(sce)[1:4, 1:4]

4 x 4 sparse Matrix of class "dgCMatrix"
                GCTGCAGTCCGATCTC ACTATGGAGGTCCCTG ATTTCTGTCTCTATGT
ENSG00000290825                .                .                .
ENSG00000223972                .                .                .
ENSG00000227232                .                .                .
ENSG00000278267                .                .                .
                TATCTGTAGGTGATAT
ENSG00000290825                .
ENSG00000223972                .
ENSG00000227232                .
ENSG00000278267                .

Accessing and storing cell and gene level metadata

As we perform analyses we will accumulate cell-level information, such as quality control metrics, clustering results, and celltype assignments. The SingleCellExperiment class stores this data within a data frame called colData, which we can access using colData(). This is a specialized Bioconductor specific data.frame class (DataFrame) which has similar semantics and functionality to a base R data.frame.

# empty right now.
colData(sce)

DataFrame with 6075 rows and 0 columns

# add a sample annotation
colData(sce)$cell_source <- "PBMC"

# equivalent approach using $
sce$cell_source <- "PBMC"

colData(sce)

DataFrame with 6075 rows and 1 column
                 cell_source
                 <character>
GCTGCAGTCCGATCTC        PBMC
ACTATGGAGGTCCCTG        PBMC
ATTTCTGTCTCTATGT        PBMC
TATCTGTAGGTGATAT        PBMC
AGCCAGCCAAAGCACG        PBMC
...                      ...
CATCCCAAGTACTCGT        PBMC
CTCCTCCCATGAAGCG        PBMC
AGTTCCCCATGTCAGT        PBMC
ACGTAGGGTGACAGCA        PBMC
TCTCAGCTCGCCGAAC        PBMC

The SingleCellExperiment also stores gene-level metadata in a data.frame called rowData(). We will use the rowData to store gene ids, symbols, and other information about genes.

# empty right now
rowData(sce)

DataFrame with 62266 rows and 0 columns

Manipulating and subsetting a SingleCellExperiment

To get familiar with SingleCellExperiment objects let’s calculate the total number of counts in each cell and store these counts in the `colData().

sce$total_counts <- colSums(counts(sce))

Next let’s calculate the total number of counts for each gene, summed across cells and the number of cells with > 0 counts per gene, and store both of these values in the rowData().

rowData(sce)$total_gene_counts <- rowSums(counts(sce))
rowData(sce)$n_cells_expr <- rowSums(counts(sce) > 0)
rowData(sce)

DataFrame with 62266 rows and 2 columns
                total_gene_counts n_cells_expr
                        <numeric>    <integer>
ENSG00000290825         65.594607           43
ENSG00000223972          0.000000            0
ENSG00000227232          5.506349           24
ENSG00000278267          0.000000            0
ENSG00000243485          0.333333            1
...                           ...          ...
ENSG00000198695              2054         1509
ENSG00000210194                 0            0
ENSG00000198727            274421         5704
ENSG00000210195                 0            0
ENSG00000210196                 0            0

We can subset the SingleCellExperiment using the same syntax used to subset base R data.frames and matrices. Note that dplyr verbs do not work with SingleCellExperiment (see alternative tidySingleCellExperiment).

data.frame[rows, columns]

sce[genes, cells]

# subset to data from first 4 genes and cells
sce[1:4, 1:4]

# subset to cells from PBMC cells
sce[, sce$cell_source == "PBMC"]


genes_to_keep <- c("ENSG00000223972", "ENSG00000210195", "ENSG00000210196")
sce[genes_to_keep, ]

cells_to_keep <- c("ACTATGGAGGTCCCTG", "GCTGCAGTCCGATCTC", "TCTCAGCTCGCCGAAC")
sce[, cells_to_keep]

Many additional functions used on data.frames also work on SingleCellExperiment:

ncol(): # of cells
nrow(): # of gene
dims(): # of genes and cells
rownames(): rownames in matrices (e.g. genes)
colnames(): colnames in matrices (e.g. cells)
cbind(): combine multiple SingleCellExperiments by column
rbind(): combine multiple SingleCellExperiments by row

Storing gene identifier information

You’ll notice that our matrix has ensembl gene ids as the rownames (e.g. ENSG00000289576, ENSG00000221539). This is intentional as these identifiers are guaranteed to be unique and are a more stable and reliable identifier than gene symbols (e.g. ACTB, GAPDH). This becomes important if you want to compare to external datasets or ensure that your data can be easily used by others in the future.

These identifiers are useful but not as easy to interpret as gene symbols. Next we will retrieve gene symbols from an AnnotationHub() ensembldb resource and store the identifiers in the rowData().

ah <- AnnotationHub()
# download ensembl database
ens_db <- ah[["AH113665"]]

loading from cache

require("ensembldb")

gene_names <- mapIds(ens_db,
  keys = rownames(sce),
  keytype = "GENEID",
  column = "SYMBOL"
)

rowData(sce)$gene <- gene_names
rowData(sce)$gene_id <- rownames(sce)
rowData(sce)

DataFrame with 62266 rows and 4 columns
                total_gene_counts n_cells_expr        gene         gene_id
                        <numeric>    <integer> <character>     <character>
ENSG00000290825         65.594607           43     DDX11L2 ENSG00000290825
ENSG00000223972          0.000000            0     DDX11L1 ENSG00000223972
ENSG00000227232          5.506349           24      WASH7P ENSG00000227232
ENSG00000278267          0.000000            0   MIR6859-1 ENSG00000278267
ENSG00000243485          0.333333            1 MIR1302-2HG ENSG00000243485
...                           ...          ...         ...             ...
ENSG00000198695              2054         1509      MT-ND6 ENSG00000198695
ENSG00000210194                 0            0       MT-TE ENSG00000210194
ENSG00000198727            274421         5704      MT-CYB ENSG00000198727
ENSG00000210195                 0            0       MT-TT ENSG00000210195
ENSG00000210196                 0            0       MT-TP ENSG00000210196

We’d like to set the rownames of the object to symbols, but some of these are , NA or duplicated which will cause problems. uniquifyFeatureNames() is a convenience function that will rename gene symbols that are NA or duplicated values to the ensembl ID or a combination of gene symbol and ensembl ID.

rownames(sce) <- uniquifyFeatureNames(
  rowData(sce)$gene_id,
  rowData(sce)$gene
)
head(rownames(sce))

[1] "DDX11L2_ENSG00000290825" "DDX11L1"                
[3] "WASH7P"                  "MIR6859-1"              
[5] "MIR1302-2HG"             "MIR1302-2"

Quality control and cell filtering

Now that we have our data in a SingleCellExperiment we will perform some filtering and quality control to remove low expression genes and poor quality cells.

Our SingleCellExperiment has 62266 genes in the matrix. Most of these are not expressed. We want to exclude these genes as they won’t provide any useful data for the analysis.

# exclude genes expressed in fewer than 10 cells (~ 1% of cells)
rowData(sce)$n_cells <- rowSums(counts(sce) > 0)
sce <- sce[rowData(sce)$n_cells >= 10, ]
sce

class: SingleCellExperiment 
dim: 20858 6075 
metadata(0):
assays(1): counts
rownames(20858): DDX11L2_ENSG00000290825 WASH7P ... MT-ND6 MT-CYB
rowData names(5): total_gene_counts n_cells_expr gene gene_id n_cells
colnames(6075): GCTGCAGTCCGATCTC ACTATGGAGGTCCCTG ... ACGTAGGGTGACAGCA
  TCTCAGCTCGCCGAAC
colData names(2): cell_source total_counts
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

To exclude low-quality cells we will use the following metrics:

Number of counts per cell barcode
Number of genes detected per barcode
The percentage of counts from mitochondrial genes per barcode

A low number of counts, a low number of detected genes, and a high percentage of mitochondrial counts suggests that the cell had a broken membrane and the cytoplasmic mRNA leaked out. Conversely, an abnormally high number of counts and detected genes could indicate the presence of a doublet. See publication for more info (Classification of low quality cells from single-cell RNA-seq data)

To calculate these metrics we can use addPerCellQCMetrics from scater. Mitochondrial genes are named with a common “MT-” prefix (e.g. MT-CO2, MT-ATP6, MR-RNR2), which we can use to identify them.

# identify subset of genes that are from mitochondrial genome
is_mito <- startsWith(rowData(sce)$gene, "MT-")
sce <- addPerCellQCMetrics(sce, subsets = list(Mito = is_mito))
colData(sce)

DataFrame with 6075 rows and 8 columns
                 cell_source total_counts       sum  detected subsets_Mito_sum
                 <character>    <numeric> <numeric> <integer>        <numeric>
GCTGCAGTCCGATCTC        PBMC        31924   31908.0      5718          2668.49
ACTATGGAGGTCCCTG        PBMC        35845   35801.5      6073          3963.23
ATTTCTGTCTCTATGT        PBMC        31788   31758.3      6377          2496.33
TATCTGTAGGTGATAT        PBMC        32025   31998.0      6260          2961.15
AGCCAGCCAAAGCACG        PBMC        29882   29856.0      5746          2660.44
...                      ...          ...       ...       ...              ...
CATCCCAAGTACTCGT        PBMC          151   151.000       132               12
CTCCTCCCATGAAGCG        PBMC          133   132.500       131                6
AGTTCCCCATGTCAGT        PBMC          173   172.667       166                8
ACGTAGGGTGACAGCA        PBMC          255   253.000       217               13
TCTCAGCTCGCCGAAC        PBMC          144   144.000       149                4
                 subsets_Mito_detected subsets_Mito_percent     total
                             <integer>            <numeric> <numeric>
GCTGCAGTCCGATCTC                    13              8.36307   31908.0
ACTATGGAGGTCCCTG                    15             11.07002   35801.5
ATTTCTGTCTCTATGT                    15              7.86038   31758.3
TATCTGTAGGTGATAT                    15              9.25417   31998.0
AGCCAGCCAAAGCACG                    15              8.91092   29856.0
...                                ...                  ...       ...
CATCCCAAGTACTCGT                     5              7.94702   151.000
CTCCTCCCATGAAGCG                     5              4.52830   132.500
AGTTCCCCATGTCAGT                     6              4.63320   172.667
ACGTAGGGTGACAGCA                     6              5.13834   253.000
TCTCAGCTCGCCGAAC                     2              2.77778   144.000

We can use the plotColData() function from scater to plot various metrics (as a ggplot2 object).

plotColData(sce, y = "sum")

plotColData(sce, y = "detected")

plotColData(sce, y = "detected", x = "sum", colour_by = "subsets_Mito_percent")

plotColData(sce, y = "subsets_Mito_percent", x = "sum") +
  labs(x = "# of counts")

Also note that you can easily extract the colData() as a data.frame to use with ggplot2 for custom plots. To extract additional features, e.g. per cell gene expression, you can use makePerCellDF() or makePerFeatureDF() from scater.

cell_info <- as.data.frame(colData(sce))
ggplot(cell_info, aes(sum, subsets_Mito_percent)) +
  geom_point()

Selecting an appropriate cutoff can be somewhat arbitrary, and there is a risk of excluding meaningful cell populations. I suggest starting with lenient cutoffs, then later increasing the stringency after examining the clustering and cell types.

Let’s define high quality cell as those with less than 20% counts from mitocondrial RNAs, greater than 500 genes detected, and greater than 1000 total counts. How many cells pass these criteria?

pass_qc <- sce$subsets_Mito_percent < 20 & sce$detected > 500 & sce$sum > 1000
sum(pass_qc)

[1] 4565

Labeling the qc failed cells in plots can be helpful.

sce$pass_qc <- sce$subsets_Mito_percent < 20 & sce$detected > 500 & sce$sum > 1000

plotColData(sce,
  y = "subsets_Mito_percent",
  x = "sum",
  colour_by = "pass_qc"
) +
  labs(x = "# of counts")

Lastly we can subset the SingleCellExperiment to exclude the low-quality cells.

sce <- sce[, sce$pass_qc]
sce

class: SingleCellExperiment 
dim: 20858 4565 
metadata(0):
assays(1): counts
rownames(20858): DDX11L2_ENSG00000290825 WASH7P ... MT-ND6 MT-CYB
rowData names(5): total_gene_counts n_cells_expr gene gene_id n_cells
colnames(4565): GCTGCAGTCCGATCTC ACTATGGAGGTCCCTG ... CTGGATCCACCGTACG
  TACAACGGTCTCGCGC
colData names(9): cell_source total_counts ... total pass_qc
reducedDimNames(0):
mainExpName: NULL
altExpNames(0):

It is often a good idea to save the SingleCellExperiment at periodic steps in the analysis. Use saveRDS() to store the object as a file, which can later be imported back into R using readRDS().

saveRDS(sce, "path/to/a/file.rds")

Analysis steps revisited

Normalization

Normalization attempts to correct for technical biases that will distort biological signal in the data. A large source of variation arises due to differences in sequencing depth between cells. This can be seen by performing PCA on the unnormalized counts. We will use runPCA from scater to perform PCA.

# set seed for functions with a randomized component
# to obtain the same result each execution
set.seed(20231023)
sce <- runPCA(sce, exprs_values = "counts", name = "count_PCA")

plotReducedDim(sce, "count_PCA", colour_by = "sum")

Note that PC1 is correlated with the total UMI counts (sum), meaning that the largest source of variation is related to differences in sequencing depth rather than biological differences between cells.

plot_df <- makePerCellDF(sce, c("count_PCA", "sum"))

ggplot(plot_df, aes(count_PCA.1, sum)) +
  geom_point()

To normalize we will use 3 functions from the scran package. The normalization entails crude clustering to group related cells, identifying a cell-specific normalization factor (size factor), then scaling the counts by this factor and log transforming the data (with a pseudocount). The algorithm is described in detail in the following paper and in the R documentation (?scuttle::computePooledFactors)

Lun AT, Bach K, Marioni JC. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 2016 Apr 27;17:75. doi: 10.1186/s13059-016-0947-7. PMID: 27122128; PMCID: PMC4848819.

set.seed(20231023)
clusters <- quickCluster(sce)
sce <- computeSumFactors(sce, clusters = clusters)
sce <- logNormCounts(sce)

set.seed(20231023)
sce <- runPCA(sce, exprs_values = "logcounts", name = "PCA")

plotReducedDim(sce, "PCA", colour_by = "sum")

plot_df <- makePerCellDF(sce, c("PCA", "sum"))
ggplot(plot_df, aes(PCA.1, sum)) +
  geom_point()

We can now see that PC1 does not correlate with total UMI counts and now shows some more distinct groups of cells, which will likely be different cell populations.