This vignette provides detailed examples for calculating and visualizing V(D)J gene usage. For the examples shown below, we use data for splenocytes from BL6 and MD4 mice collected using the 10X Genomics scRNA-seq platform. MD4 B cells are monoclonal and specifically bind hen egg lysozyme.
library(djvdj)
library(Seurat)
library(ggplot2)
library(RColorBrewer)
# Load GEX data
data_dir <- system.file("extdata/splen", package = "djvdj")
gex_dirs <- c(
BL6 = file.path(data_dir, "BL6_GEX/filtered_feature_bc_matrix"),
MD4 = file.path(data_dir, "MD4_GEX/filtered_feature_bc_matrix")
)
so <- gex_dirs |>
Read10X() |>
CreateSeuratObject() |>
AddMetaData(splen_meta)
# Add V(D)J data to object
vdj_dirs <- c(
BL6 = system.file("extdata/splen/BL6_BCR", package = "djvdj"),
MD4 = system.file("extdata/splen/MD4_BCR", package = "djvdj")
)
so <- so |>
import_vdj(vdj_dirs, define_clonotypes = "cdr3_gene")
Calculating gene usage
The calc_gene_usage()
function will calculate the number
of cells (‘freq’) and percentage of cells (‘pct’) with each gene in the
data_cols
column(s). The ‘n_cells’ column shows the total
number of cells used for calculating percentages. By default these
results are added to the object meta.data, to return a data.frame set
return_df
to TRUE
.
so |>
calc_gene_usage(
data_cols = "v_gene",
return_df = TRUE
)
#> # A tibble: 103 × 4
#> v_gene n_cells freq pct
#> <chr> <int> <int> <dbl>
#> 1 IGKV5-43 329 156 47.4
#> 2 IGKV10-96 329 14 4.26
#> 3 IGKV6-17 329 9 2.74
#> 4 IGLV1 329 9 2.74
#> 5 IGKV14-111 329 8 2.43
#> 6 IGKV1-117 329 7 2.13
#> 7 IGKV13-84 329 7 2.13
#> 8 IGKV1-135 329 6 1.82
#> 9 IGKV19-93 329 6 1.82
#> 10 IGKV3-4 329 6 1.82
#> # ℹ 93 more rows
To perform gene usage calculations separately for cell clusters (or
samples), provide the meta.data column containing cluster labels to the
cluster_col
argument. Here we see that the MD4 samples
almost exclusively use a single V segment (IGKV5-43), which is expected
since MD4 B cells are monoclonal.
so |>
calc_gene_usage(
data_cols = "v_gene",
cluster_col = "sample",
return_df = TRUE
)
#> # A tibble: 618 × 6
#> v_gene sample n_cells freq pct shared
#> <chr> <chr> <dbl> <int> <dbl> <lgl>
#> 1 IGKV5-43 MD4-2 55 55 100 TRUE
#> 2 IGKV5-43 MD4-3 54 51 94.4 TRUE
#> 3 IGKV5-43 MD4-1 50 47 94 TRUE
#> 4 IGKV10-96 BL6-1 55 6 10.9 TRUE
#> 5 IGKV10-96 BL6-3 52 5 9.62 TRUE
#> 6 IGKV14-111 BL6-2 63 6 9.52 TRUE
#> 7 IGLV1 BL6-3 52 4 7.69 TRUE
#> 8 IGHV6-6 BL6-1 55 4 7.27 FALSE
#> 9 IGKV6-17 BL6-2 63 4 6.35 TRUE
#> 10 IGKV8-24 BL6-2 63 4 6.35 FALSE
#> # ℹ 608 more rows
To only perform calculations for a specific chain, use the
chain
argument. In this example we are only returning
results for the IGK chain. Here we see some values in the ‘v_gene’
column labeled as ‘None’, this shows the number of cells that did not
have a V gene segment identified.
so |>
calc_gene_usage(
data_cols = "v_gene",
cluster_col = "sample",
chain = "IGK",
return_df = TRUE
)
#> # A tibble: 336 × 6
#> v_gene sample n_cells freq pct shared
#> <chr> <chr> <dbl> <int> <dbl> <lgl>
#> 1 IGKV5-43 MD4-2 55 55 100 TRUE
#> 2 IGKV5-43 MD4-3 54 51 94.4 TRUE
#> 3 IGKV5-43 MD4-1 50 47 94 TRUE
#> 4 IGKV10-96 BL6-1 55 6 10.9 TRUE
#> 5 IGKV10-96 BL6-3 52 5 9.62 TRUE
#> 6 None BL6-3 52 5 9.62 TRUE
#> 7 IGKV14-111 BL6-2 63 6 9.52 TRUE
#> 8 IGKV6-17 BL6-2 63 4 6.35 TRUE
#> 9 IGKV8-24 BL6-2 63 4 6.35 FALSE
#> 10 IGKV3-4 BL6-3 52 3 5.77 TRUE
#> # ℹ 326 more rows
If two columns are provided to the data_cols
argument,
the number of cells containing each combination of genes is
returned.
so |>
calc_gene_usage(
data_cols = c("v_gene", "j_gene"),
cluster_col = "sample",
return_df = TRUE
)
#> # A tibble: 6,798 × 7
#> v_gene j_gene sample n_cells freq pct shared
#> <chr> <chr> <chr> <dbl> <int> <dbl> <lgl>
#> 1 IGKV5-43 IGKJ2 MD4-2 55 55 100 TRUE
#> 2 IGKV5-43 IGKJ2 MD4-3 54 51 94.4 TRUE
#> 3 IGKV5-43 IGKJ2 MD4-1 50 47 94 TRUE
#> 4 IGKV14-111 IGKJ2 BL6-2 63 4 6.35 FALSE
#> 5 IGKV10-96 IGKJ1 BL6-3 52 3 5.77 TRUE
#> 6 IGLV1 IGLJ1 BL6-3 52 3 5.77 TRUE
#> 7 IGKV10-96 IGKJ2 BL6-1 55 3 5.45 TRUE
#> 8 IGKV10-96 IGKJ1 BL6-2 63 3 4.76 TRUE
#> 9 IGKV1-135 IGKJ1 BL6-3 52 2 3.85 TRUE
#> 10 IGKV10-96 IGKJ2 BL6-3 52 2 3.85 TRUE
#> # ℹ 6,788 more rows
Plotting gene usage
The plot_gene_usage()
function will summarize the
frequency of each gene segment. By default if a single column is passed
to the data_cols
argument, a bargraph will be returned. The
number of top genes to include in the plot can be specified with the
genes
argument.
so |>
plot_gene_usage(
data_cols = "v_gene",
cluster_col = "orig.ident",
genes = 20
)
By default, percentages are shown on the y-axis, to instead plot the
frequency, set the units
argument to ‘frequency’.
so |>
plot_gene_usage(
data_cols = "v_gene",
cluster_col = "orig.ident",
units = "frequency"
)
Plot colors can be adjusted using the plot_colors
argument. In addition, plot_gene_usage()
returns a ggplot
object that can be modified with ggplot2 functions such as
ggplot2::theme()
. Plots can be further adjusted by passing
aesthetic parameters directly to ggplot2, e.g. alpha
,
linetype
, color
, etc.
so |>
plot_gene_usage(
data_cols = "v_gene",
cluster_col = "orig.ident",
plot_colors = c(BL6 = "#3288BD", MD4 = "#D53E4F"),
color = "black", # parameters to pass to ggplot2
alpha = 0.7
) +
theme(axis.text.x = element_text(angle = 90))
If two columns are passed to the data_cols
argument, a
heatmap will be generated summarizing the usage of different pairs of
segments. If a column is provided to the cluster_col
argument, a separate heatmap will be generated for each cluster.
In this example we are plotting the frequency that different heavy chain V and J segments appear together.
so |>
plot_gene_usage(
data_cols = c("v_gene", "j_gene"),
cluster_col = "orig.ident",
chain = "IGH",
genes = 15
)
The paired gene usage for two chains can also be plotted using
plot_gene_pairs()
. In this example we are plotting the
frequency that different heavy and light chain V segments appear
together.
so |>
plot_gene_pairs(
data_col = "v_gene",
chains = c("IGH", "IGK"),
cluster_col = "orig.ident",
genes = 12
)
Circos plot
A circos plot can be created by setting the method
argument to ‘circos’. This plot will summarize the number of cells
containing different gene pairs, which is shown as the axis labels for
each sample. This requires the circlize package to be installed.
In this example, we are summarizing the segment usage for the entire
dataset (BL6 and MD4 cells combined). The cluster_col
argument can be used to create a separate plot for each sample. Labels
can be rotated to eliminate overlapping text using the
rotate_labels
argument.
so |>
plot_gene_usage(
data_cols = c("v_gene", "j_gene"),
method = "circos",
genes = 6,
rotate_labels = TRUE
)
Plot colors can be modified using the plot_colors
argument, additional parameters can be passed directly to
circlize::chordDiagram()
. In this example we add a border
around the links and scale the plot so each sample is the same
width.
so |>
plot_gene_usage(
data_cols = c("v_gene", "j_gene"),
method = "circos",
genes = 8,
plot_colors = brewer.pal(10, "Spectral"),
link.border = "black", # parameters to pass to chordDiagram()
scale = TRUE
)
Gene segment usage can be plotted separately for cell clusters (or
samples) using the cluster_col
argument. The number of rows
used to arrange plots can be modified using the panel_nrow
argument.
so |>
plot_gene_usage(
data_cols = c("v_gene", "j_gene"),
method = "circos",
cluster_col = "sample",
genes = 5,
panel_nrow = 2,
scale = TRUE
)
Session info
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] RColorBrewer_1.1-3 ggplot2_3.4.4 SeuratObject_4.1.4
#> [4] Seurat_4.4.0 djvdj_0.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] shape_1.4.6 jsonlite_1.8.7 magrittr_2.0.3
#> [4] spatstat.utils_3.0-3 farver_2.1.1 rmarkdown_2.25
#> [7] GlobalOptions_0.1.2 fs_1.6.3 ragg_1.2.6
#> [10] vctrs_0.6.4 ROCR_1.0-11 memoise_2.0.1
#> [13] spatstat.explore_3.2-5 htmltools_0.5.6.1 sass_0.4.7
#> [16] sctransform_0.4.1 parallelly_1.36.0 KernSmooth_2.23-21
#> [19] bslib_0.5.1 htmlwidgets_1.6.2 desc_1.4.2
#> [22] ica_1.0-3 plyr_1.8.9 plotly_4.10.3
#> [25] zoo_1.8-12 cachem_1.0.8 igraph_1.5.1
#> [28] mime_0.12 lifecycle_1.0.3 pkgconfig_2.0.3
#> [31] Matrix_1.6-1.1 R6_2.5.1 fastmap_1.1.1
#> [34] fitdistrplus_1.1-11 future_1.33.0 shiny_1.7.5.1
#> [37] digest_0.6.33 colorspace_2.1-0 patchwork_1.1.3
#> [40] rprojroot_2.0.3 tensor_1.5 irlba_2.3.5.1
#> [43] textshaping_0.3.7 labeling_0.4.3 progressr_0.14.0
#> [46] fansi_1.0.5 spatstat.sparse_3.0-2 httr_1.4.7
#> [49] polyclip_1.10-6 abind_1.4-5 compiler_4.3.1
#> [52] bit64_4.0.5 withr_2.5.1 MASS_7.3-60
#> [55] tools_4.3.1 lmtest_0.9-40 httpuv_1.6.12
#> [58] future.apply_1.11.0 goftest_1.2-3 glue_1.6.2
#> [61] nlme_3.1-162 promises_1.2.1 grid_4.3.1
#> [64] Rtsne_0.16 cluster_2.1.4 reshape2_1.4.4
#> [67] generics_0.1.3 gtable_0.3.4 spatstat.data_3.0-1
#> [70] tzdb_0.4.0 tidyr_1.3.0 data.table_1.14.8
#> [73] hms_1.1.3 sp_2.1-1 utf8_1.2.4
#> [76] spatstat.geom_3.2-7 RcppAnnoy_0.0.21 ggrepel_0.9.4
#> [79] RANN_2.6.1 pillar_1.9.0 stringr_1.5.0
#> [82] vroom_1.6.4 later_1.3.1 circlize_0.4.15
#> [85] splines_4.3.1 dplyr_1.1.3 lattice_0.21-8
#> [88] bit_4.0.5 survival_3.5-5 deldir_1.0-9
#> [91] tidyselect_1.2.0 miniUI_0.1.1.1 pbapply_1.7-2
#> [94] knitr_1.44 gridExtra_2.3 scattermore_1.2
#> [97] xfun_0.40 matrixStats_1.0.0 stringi_1.7.12
#> [100] lazyeval_0.2.2 yaml_2.3.7 evaluate_0.22
#> [103] codetools_0.2-19 tibble_3.2.1 cli_3.6.1
#> [106] uwot_0.1.16 xtable_1.8-4 reticulate_1.34.0
#> [109] systemfonts_1.0.5 munsell_0.5.0 jquerylib_0.1.4
#> [112] Rcpp_1.0.11 globals_0.16.2 spatstat.random_3.2-1
#> [115] png_0.1-8 parallel_4.3.1 ellipsis_0.3.2
#> [118] pkgdown_2.0.7 readr_2.1.4 listenv_0.9.0
#> [121] viridisLite_0.4.2 scales_1.2.1 ggridges_0.5.4
#> [124] crayon_1.5.2 leiden_0.4.3 purrr_1.0.2
#> [127] rlang_1.1.1 cowplot_1.1.1