This vignette provides detailed examples for quantifying differences in clonal abundance. For the examples shown below, we use data for splenocytes from BL6 and MD4 mice collected using the 10X Genomics scRNA-seq platform. MD4 B cells are monoclonal and specifically bind hen egg lysozyme.
library(djvdj)
library(Seurat)
library(ggplot2)
# Add V(D)J data to object
vdj_dirs <- c(
BL6 = system.file("extdata/splen/BL6_BCR", package = "djvdj"),
MD4 = system.file("extdata/splen/MD4_BCR", package = "djvdj")
)
so <- splen_so |>
import_vdj(vdj_dirs, define_clonotypes = "cdr3_gene")
Calculating clonal abundance
To quantify clonotype abundance and store the results in the object
meta.data, the calc_frequency()
function can be used. This
will add columns showing the number of occurrences of each clonotype
(‘freq’), the percentage of cells sharing the clonotype (‘pct’), and a
label that can be used for plotting (‘grp’). By default these
calculations will be performed for all cells in the object.
so_vdj <- so |>
calc_frequency(data_col = "clonotype_id")
To calculate clonotype abundance separately for samples or clusters,
the cluster_col
argument can be used. To do this just
specify the name of the column containing the sample or cluster IDs for
each cell.
so_vdj <- so |>
calc_frequency(
data_col = "clonotype_id",
cluster_col = "sample"
)
When cluster_col
is specified, an additional meta.data
column (‘shared’) will be added indicating whether the clonotype is
shared between multiple clusters.
so_vdj |>
slot("meta.data") |>
head(2)
#> orig.ident nCount_RNA nFeature_RNA RNA_snn_res.1
#> BL6_AAACGGGGTTCTGTTT-1 BL6 666 341 2
#> BL6_AAAGATGCAACAACCT-1 BL6 308 233 0
#> seurat_clusters UMAP_1 UMAP_2
#> BL6_AAACGGGGTTCTGTTT-1 2 -0.2850037 -2.036348
#> BL6_AAAGATGCAACAACCT-1 0 2.2518005 -1.472473
#> type r cell_type sample
#> BL6_AAACGGGGTTCTGTTT-1 B cells (B.CD19CONTROL) 0.4712686 B cells BL6-1
#> BL6_AAAGATGCAACAACCT-1 B cells (B.CD19CONTROL) 0.5435733 B cells BL6-1
#> exact_subclonotype_id chains n_chains cdr3
#> BL6_AAACGGGGTTCTGTTT-1 NA <NA> NA <NA>
#> BL6_AAAGATGCAACAACCT-1 1 IGK 1 CFQGSHVPWTF
#> cdr3_nt cdr3_length
#> BL6_AAACGGGGTTCTGTTT-1 <NA> <NA>
#> BL6_AAAGATGCAACAACCT-1 TGCTTTCAAGGTTCACATGTTCCGTGGACGTTC 11
#> cdr3_nt_length v_gene d_gene j_gene c_gene
#> BL6_AAACGGGGTTCTGTTT-1 <NA> <NA> <NA> <NA> <NA>
#> BL6_AAAGATGCAACAACCT-1 33 IGKV1-117 None IGKJ1 IGKC
#> isotype reads umis productive full_length paired
#> BL6_AAACGGGGTTCTGTTT-1 <NA> <NA> <NA> <NA> <NA> NA
#> BL6_AAAGATGCAACAACCT-1 None 352 21 TRUE TRUE FALSE
#> clonotype_id clonotype_id_freq clonotype_id_pct
#> BL6_AAACGGGGTTCTGTTT-1 <NA> NA NA
#> BL6_AAAGATGCAACAACCT-1 clonotype34 1 1.818182
#> clonotype_id_shared clonotype_id_grp
#> BL6_AAACGGGGTTCTGTTT-1 NA <NA>
#> BL6_AAAGATGCAACAACCT-1 TRUE 1
Plotting clonal abundance
djvdj includes the plot_clonal_abundance()
function to
visualize differences in clonotype frequency between samples or
clusters. By default this will produce bargraphs. Plot colors can be
adjusted using the plot_colors
argument.
so |>
plot_clonal_abundance(
clonotype_col = "clonotype_id",
plot_colors = "#3182bd"
)
Abundance values can be calculated and plotted separately for each
sample or cluster using the cluster_col
argument. The
panel_nrow
and panel_scales
arguments can be
used to add separate scales for each sample or to adjust the number of
rows used to arrange plots.
As expected we see that most MD4 B cells share the same clonotype, while BL6 cells have a diverse repertoire.
so |>
plot_clonal_abundance(
clonotype_col = "clonotype_id",
cluster_col = "orig.ident",
panel_scales = "free"
)
Rank-abundance plots can also be generated by setting the
method
argument to ‘line’. Most djvdj plotting functions
return ggplot objects that can be further modified with ggplot2
functions. Here we log10-transform the y-axis using the
ggplot2::scale_y_log10()
function.
so |>
plot_clonal_abundance(
clonotype_col = "clonotype_id",
cluster_col = "orig.ident",
method = "line",
plot_colors = c(MD4 = "#fec44f", BL6 = "#3182bd")
) +
scale_y_log10()
UMAP projections
By default calc_frequency()
will divide clonotypes into
groups based on abundance and add a column to the meta.data containing
these group labels. Clonotype abundance can be summarized on a UMAP
projection by plotting the added ‘grp’ column using the generic plotting
function plot_features()
# Create UMAP summarizing samples
mouse_gg <- so |>
plot_features(
feature = "orig.ident",
size = 0.25
)
# Create UMAP summarizing clonotype abundance
abun_gg <- so |>
calc_frequency(
data_col = "clonotype_id",
cluster_col = "sample"
) |>
plot_features(
feature = "clonotype_id_grp",
size = 0.25
)
mouse_gg + abun_gg
Highly abundant clonotypes can also be specifically labeled on a UMAP
projection. To do this, add a new meta.data column with the desired
label using the mutate_vdj()
function. This function works
in a similar manner as dplyr::mutate()
, but will
specifically modify the object meta.data and allows to the user to parse
per-chain information for each cell.
top_gg <- so |>
mutate_vdj(
top_clonotype = ifelse(clonotype_id == "clonotype907", clonotype_id, "other")
) |>
plot_features(
feature = "top_clonotype",
size = 0.25,
plot_colors = c(other = "#fec44f", clonotype907 = "#3182bd")
)
mouse_gg + top_gg
Other frequency calculations
In addition to clonotype abundance, calc_frequency()
can
be used to summarize the frequency of any cell label present in the
object. In this example we count the number of cells present for each
cell type in each sample.
so_vdj <- so |>
calc_frequency(
data_col = "cell_type",
cluster_col = "sample"
)
To plot the fraction of cells present for each cell type, we can use
the generic plotting function, plot_frequency()
. This will
create stacked bargraphs summarizing each cell label present in the
data_col
column. The color of each group can be specified
with the plot_colors
argument.
so |>
plot_frequency(
data_col = "cell_type",
cluster_col = "sample",
plot_colors = c("#3182bd", "#fec44f", "#31a354")
)
To summarize the number cells present for each cell type, set the
units
argument to ‘frequency’.
so |>
plot_frequency(
data_col = "cell_type",
cluster_col = "sample",
units = "frequency",
stack = FALSE
)
Frequency plots can also be separated based on an additional grouping
variable such as treatment group (e.g. placebo vs drug) or disease
status (e.g. healthy vs disease). This will generate boxplots with each
point representing a label present in the cluster_col
column. In this example we have 3 BL6 and 3 MD4 samples, so there are 3
points shown for each boxplot.
so |>
plot_frequency(
data_col = "cell_type",
cluster_col = "sample",
group_col = "orig.ident",
plot_colors = c(MD4 = "#fec44f", BL6 = "#3182bd")
)