This vignette provides detailed examples for quantifying differences in clonotype frequencies. For the examples shown below, we use data for splenocytes from BL6 and MD4 mice collected using the 10X Genomics scRNA-seq platform. MD4 B cells are monoclonal and specifically bind hen egg lysozyme.
library(djvdj)
library(Seurat)
library(ggplot2)
# Load GEX data
data_dir <- system.file("extdata/splen", package = "djvdj")
gex_dirs <- c(
BL6 = file.path(data_dir, "BL6_GEX/filtered_feature_bc_matrix"),
MD4 = file.path(data_dir, "MD4_GEX/filtered_feature_bc_matrix")
)
so <- gex_dirs |>
Read10X() |>
CreateSeuratObject() |>
AddMetaData(splen_meta)
# Add V(D)J data to object
vdj_dirs <- c(
BL6 = system.file("extdata/splen/BL6_BCR", package = "djvdj"),
MD4 = system.file("extdata/splen/MD4_BCR", package = "djvdj")
)
so <- so |>
import_vdj(vdj_dirs, define_clonotypes = "cdr3_gene")
Calculating clonotype frequencies
To quantify clonotype frequencies and store the results in the object
meta.data, the calc_frequency()
function can be used. This
will add columns showing the number of occurrences of each clonotype
(‘freq’), the percentage of cells sharing the clonotype (‘pct’), and a
label that can be used for plotting (‘grp’). By default these
calculations will be performed for all cells in the object.
so_vdj <- so |>
calc_frequency(data_col = "clonotype_id")
To calculate clonotype frequencies separately for samples or
clusters, the cluster_col
argument can be used. To do this
just specify the name of the column containing the sample or cluster IDs
for each cell.
so_vdj <- so |>
calc_frequency(
data_col = "clonotype_id",
cluster_col = "sample"
)
When cluster_col
is specified, an additional meta.data
column (‘shared’) will be added indicating whether the clonotype is
shared between multiple clusters.
so_vdj |>
slot("meta.data") |>
head(2)
#> orig.ident nCount_RNA nFeature_RNA RNA_snn_res.1
#> BL6_AAACGGGGTTCTGTTT-1 BL6 202 25 0
#> BL6_AAAGATGCAACAACCT-1 BL6 42 20 3
#> seurat_clusters UMAP_1 UMAP_2
#> BL6_AAACGGGGTTCTGTTT-1 0 -1.7410439 0.8840749
#> BL6_AAAGATGCAACAACCT-1 3 0.9088528 -1.2614110
#> type r cell_type sample
#> BL6_AAACGGGGTTCTGTTT-1 B cells (B.T3) 0.6256432 B cells BL6-1
#> BL6_AAAGATGCAACAACCT-1 B cells (B.T2) 0.6350011 B cells BL6-1
#> exact_subclonotype_id chains n_chains cdr3
#> BL6_AAACGGGGTTCTGTTT-1 NA <NA> NA <NA>
#> BL6_AAAGATGCAACAACCT-1 1 IGK 1 CFQGSHVPWTF
#> cdr3_nt cdr3_length
#> BL6_AAACGGGGTTCTGTTT-1 <NA> <NA>
#> BL6_AAAGATGCAACAACCT-1 TGCTTTCAAGGTTCACATGTTCCGTGGACGTTC 11
#> cdr3_nt_length v_gene d_gene j_gene c_gene
#> BL6_AAACGGGGTTCTGTTT-1 <NA> <NA> <NA> <NA> <NA>
#> BL6_AAAGATGCAACAACCT-1 33 IGKV1-117 None IGKJ1 IGKC
#> isotype reads umis productive full_length paired
#> BL6_AAACGGGGTTCTGTTT-1 <NA> <NA> <NA> <NA> <NA> NA
#> BL6_AAAGATGCAACAACCT-1 None 352 21 TRUE TRUE FALSE
#> clonotype_id n_cells clonotype_id_freq
#> BL6_AAACGGGGTTCTGTTT-1 <NA> NA NA
#> BL6_AAAGATGCAACAACCT-1 clonotype34 55 1
#> clonotype_id_pct clonotype_id_shared
#> BL6_AAACGGGGTTCTGTTT-1 NA NA
#> BL6_AAAGATGCAACAACCT-1 1.818182 TRUE
#> clonotype_id_grp
#> BL6_AAACGGGGTTCTGTTT-1 <NA>
#> BL6_AAAGATGCAACAACCT-1 1
Plotting clonotype frequencies
djvdj includes the plot_clone_frequency()
function to
visualize differences in clonotype frequency between samples or
clusters. By default this will produce bargraphs. Plot colors can be
adjusted using the plot_colors
argument.
so |>
plot_clone_frequency(
data_col = "clonotype_id",
plot_colors = "#3182bd"
)
Frequencies can be calculated and plotted separately for each sample
or cluster using the cluster_col
argument. The
panel_nrow
and panel_scales
arguments can be
used to add separate scales for each sample or to adjust the number of
rows used to arrange plots.
As expected we see that most MD4 B cells share the same clonotype, while BL6 cells have a diverse repertoire.
so |>
plot_clone_frequency(
data_col = "clonotype_id",
cluster_col = "orig.ident",
panel_scales = "free"
)
Rank-abundance plots can also be generated by setting the
method
argument to ‘line’. Most djvdj plotting functions
return ggplot objects that can be further modified with ggplot2
functions. Here we further modify plot aesthetics using the
ggplot::theme()
function. Most djvdj plotting function also
include the ability to transform the axis using the trans
argument.
so |>
plot_clone_frequency(
data_col = "clonotype_id",
cluster_col = "orig.ident",
method = "line",
plot_colors = c(MD4 = "#fec44f", BL6 = "#3182bd"),
trans = "log10" # log-transform axis
) +
theme(aspect.ratio = 0.8)
UMAP projections
By default calc_frequency()
will divide clonotypes into
groups based on frequency and add a column to the meta.data containing
these group labels. Clonotype frequencies can be summarized on a UMAP
projection by plotting the added ‘grp’ column using the generic plotting
function plot_scatter()
.
Cells that lack BCR data will be plotted as NA
s, the
color of these points can be adjusted using the na_color
argument.
# Create UMAP summarizing samples
mouse_gg <- so |>
plot_scatter(data_col = "orig.ident")
# Create UMAP summarizing clonotype frequencies
abun_gg <- so |>
calc_frequency(
data_col = "clonotype_id",
cluster_col = "sample"
) |>
plot_scatter(data_col = "clonotype_id_grp")
mouse_gg + abun_gg
Highly abundant clonotypes can also be specifically labeled on a UMAP
projection. To do this, pass a vector of top clonotypes to highlight to
the top
argument of plot_scatter()
.
top_gg <- so |>
plot_scatter(
data_col = "clonotype_id",
top = "clonotype56",
plot_colors = c(other = "#fec44f", clonotype56 = "#3182bd")
)
mouse_gg + top_gg
Other frequency calculations
In addition to clonotype abundance, calc_frequency()
can
be used to summarize the frequency of any cell label present in the
object. In this example we count the number of cells present for each
cell type in each sample.
so_vdj <- so |>
calc_frequency(
data_col = "cell_type",
cluster_col = "sample"
)
To plot the fraction of cells present for each cell type, we can use
the generic plotting function, plot_frequency()
. This will
create stacked bargraphs summarizing each cell label present in the
data_col
column. The color of each group can be specified
with the plot_colors
argument.
so |>
plot_frequency(
data_col = "cell_type",
cluster_col = "sample",
plot_colors = c("#3182bd", "#fec44f")
)
To summarize the number cells present for each cell type, set the
units
argument to ‘frequency’. To create grouped bargraphs,
set the stack
argument to FALSE
.
so |>
plot_frequency(
data_col = "cell_type",
cluster_col = "sample",
units = "frequency",
stack = FALSE
)
Clusters can also be grouped based on an additional variable such as
treatment group (e.g. placebo vs drug) or disease status (e.g. healthy
vs disease). This will generate bargraphs (or boxplots) showing the mean
and standard deviation for each group. In this example we are comparing
the 3 BL6 and 3 MD4 samples. You will also notice that there is a group
labeled as NA
, these are cells that lacked V(D)J data and
thus did not have an assigned isotype.
so |>
plot_frequency(
data_col = "isotype",
cluster_col = "sample",
group_col = "orig.ident",
plot_colors = c(MD4 = "#fec44f", BL6 = "#3182bd")
)
p-values
p-values can be calculated and shown on plots generated by
plot_frequency()
and plot_gene_usage()
. To do
this, you must pass a grouping variable to the group_col
argument, which is used to group the clusters found in
cluster_col
. This is best used when you have a set of
samples that can be divided into distinct groups. The cluster names
should be unique for each treatment group, e.g. healthy: healthy-1,
healthy-2; disease: disease-1, disease-2.
The method used to calculate p-values can be specified with the
p_method
argument. By default a t-test will be performed,
if more than two groups are compared the Kruskal-Wallis test will be
used. A summary table of the calculated p-values can also be saved by
passing a path to the p_file
argument.
The p_label
argument can be used to modify which
p-values are shown on the plot, by default only significant p-values are
shown. In this example we display all p-values calculated using a
t-test.
so |>
plot_frequency(
data_col = "isotype",
cluster_col = "sample",
group_col = "orig.ident",
p_label = "all",
p_method = "t"
)
Custom labels for different p-value cutoffs can be specified by
passing a named vector to the p_label
argument. To display
the actual p-value when it is below a certain threshold, use the keyword
‘value’. Symbols can also be displayed by including the unicode symbol
code. In this example we display p-values <0.05, print a soccer ball
for <0.1, and all others are labeled as ‘ns’.
so |>
plot_frequency(
data_col = "isotype",
cluster_col = "sample",
group_col = "orig.ident",
p_label = c(value = 0.05, "\\u26BD" = 0.1, ns = Inf)
)
Label aesthetics can be modified by passing a named list of aesthetic
parameters to the label_params
argument. These parameters
will also modify the n-label, to specifically modify the p-value label,
prefix each parameter with ‘p.’, e.g. p.size = 14
.
so |>
plot_frequency(
data_col = "isotype",
cluster_col = "sample",
group_col = "orig.ident",
n_label = "corner",
label_params = list(p.color = "red")
)
Session info
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] ggplot2_3.4.4 SeuratObject_4.1.4 Seurat_4.4.0
#> [4] djvdj_0.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 jsonlite_1.8.7 magrittr_2.0.3
#> [4] spatstat.utils_3.0-3 farver_2.1.1 rmarkdown_2.25
#> [7] fs_1.6.3 ragg_1.2.6 vctrs_0.6.4
#> [10] ROCR_1.0-11 memoise_2.0.1 spatstat.explore_3.2-5
#> [13] htmltools_0.5.6.1 sass_0.4.7 sctransform_0.4.1
#> [16] parallelly_1.36.0 KernSmooth_2.23-21 bslib_0.5.1
#> [19] htmlwidgets_1.6.2 desc_1.4.2 ica_1.0-3
#> [22] plyr_1.8.9 plotly_4.10.3 zoo_1.8-12
#> [25] cachem_1.0.8 igraph_1.5.1 mime_0.12
#> [28] lifecycle_1.0.3 pkgconfig_2.0.3 Matrix_1.6-1.1
#> [31] R6_2.5.1 fastmap_1.1.1 fitdistrplus_1.1-11
#> [34] future_1.33.0 shiny_1.7.5.1 digest_0.6.33
#> [37] colorspace_2.1-0 patchwork_1.1.3 rprojroot_2.0.3
#> [40] tensor_1.5 irlba_2.3.5.1 textshaping_0.3.7
#> [43] labeling_0.4.3 progressr_0.14.0 fansi_1.0.5
#> [46] spatstat.sparse_3.0-2 httr_1.4.7 polyclip_1.10-6
#> [49] abind_1.4-5 compiler_4.3.1 bit64_4.0.5
#> [52] withr_2.5.1 MASS_7.3-60 tools_4.3.1
#> [55] lmtest_0.9-40 httpuv_1.6.12 future.apply_1.11.0
#> [58] goftest_1.2-3 glue_1.6.2 nlme_3.1-162
#> [61] promises_1.2.1 grid_4.3.1 Rtsne_0.16
#> [64] cluster_2.1.4 reshape2_1.4.4 generics_0.1.3
#> [67] gtable_0.3.4 spatstat.data_3.0-1 tzdb_0.4.0
#> [70] tidyr_1.3.0 data.table_1.14.8 hms_1.1.3
#> [73] sp_2.1-1 utf8_1.2.4 spatstat.geom_3.2-7
#> [76] RcppAnnoy_0.0.21 ggrepel_0.9.4 RANN_2.6.1
#> [79] pillar_1.9.0 stringr_1.5.0 vroom_1.6.4
#> [82] later_1.3.1 splines_4.3.1 dplyr_1.1.3
#> [85] lattice_0.21-8 bit_4.0.5 survival_3.5-5
#> [88] deldir_1.0-9 tidyselect_1.2.0 miniUI_0.1.1.1
#> [91] pbapply_1.7-2 knitr_1.44 gridExtra_2.3
#> [94] scattermore_1.2 xfun_0.40 matrixStats_1.0.0
#> [97] stringi_1.7.12 lazyeval_0.2.2 yaml_2.3.7
#> [100] evaluate_0.22 codetools_0.2-19 tibble_3.2.1
#> [103] cli_3.6.1 uwot_0.1.16 xtable_1.8-4
#> [106] reticulate_1.34.0 systemfonts_1.0.5 munsell_0.5.0
#> [109] jquerylib_0.1.4 Rcpp_1.0.11 globals_0.16.2
#> [112] spatstat.random_3.2-1 png_0.1-8 parallel_4.3.1
#> [115] ellipsis_0.3.2 pkgdown_2.0.7 readr_2.1.4
#> [118] listenv_0.9.0 viridisLite_0.4.2 scales_1.2.1
#> [121] ggridges_0.5.4 crayon_1.5.2 leiden_0.4.3
#> [124] purrr_1.0.2 rlang_1.1.1 cowplot_1.1.1