Some comments on using clustifyr with other scRNA-seq analysis-augmenting packages

Suited situations

clustifyr is designed to be compatible with most workflows that find distinct cell clusters, even if the cells are overclustered (still with at least 15+ cells per cluster). However, if the data is continuous (such as development transitions), then other alternative approaches may be better suited.

Feature selection

Using all genes works moderately well, but performance is improved in various testing scenarios by some form of feature selection, such as var.genes in seurat, or M3Drop, or simply highly variable genes. In general, we see satifactory results with 500-1000 variable genes, although this number is dependent upon the biology in the target dataset. Note that SeuratV3 now always stores 2000 variable genes, which may be too many.

Imputation

Imputation methods such as ALRA, in attempt to fill in drop-out signal, are not necessary for cluster-level identity assignment. Per-cell assignment is somewhat improved with imputation. Alternatively, clustifyr offers the rm0 option to treat genes with 0 count as missing instead of low expression, and ignore them.

Variance stabilizing transformation

clustifyr operates on raw counts and log transformed data. VST, such as implemented by sctransform is acceptable, but does not appear to give additional benefits. If used, ideally the query and reference matrices should both be transformed.

Background signal removal

Massive cell death can potentially lead to RNA contamination of all cells sequenced. Assessment of background contamination, and mitigation actions if needed, is important. Building a background reference by averaging all filtered out cell ids would be one simple way.

Marker gene list selection

Marker genes generated by seurat and other methods can be converted to gene list matrix form for clustify_list. Somewhat differently, clustify_nudge works best with normal feature selection plus a short and nonoverlapping list of markers. Both approaches are handled by settings in matrixize_markers.

On a related note, in most cases where the significance of marker RPL or RPS genes is unclear, possibly due to normalization issues, matrixize_markers offers a remove_rp option. This and similar actions can of course be done manually as well.