This vignette provides detailed examples for manipulating V(D)J data imported into a single-cell object using djvdj. For the examples shown below, we use data for splenocytes from BL6 and MD4 mice collected using the 10X Genomics scRNA-seq platform. MD4 B cells are monoclonal and specifically bind hen egg lysozyme.
library(djvdj)
library(Seurat)
library(ggplot2)
library(dplyr)
# Load GEX data
data_dir <- system.file("extdata/splen", package = "djvdj")
gex_dirs <- c(
BL6 = file.path(data_dir, "BL6_GEX/filtered_feature_bc_matrix"),
MD4 = file.path(data_dir, "MD4_GEX/filtered_feature_bc_matrix")
)
so <- gex_dirs |>
Read10X() |>
CreateSeuratObject() |>
AddMetaData(splen_meta)
# Add V(D)J data to object
vdj_dirs <- c(
BL6 = system.file("extdata/splen/BL6_BCR", package = "djvdj"),
MD4 = system.file("extdata/splen/MD4_BCR", package = "djvdj")
)
so <- so |>
import_vdj(
vdj_dirs,
define_clonotypes = "cdr3_gene",
include_mutations = TRUE
)
Filtering V(D)J data
Per-chain data can be filtered with the filter_vdj()
function. The per-chain values for each cell are parsed based on the
;
separator and converted to a vector. This allows vector
operations to be used for filtering. This function will not remove cells
from the object, but instead remove the V(D)J data for cells that do not
match the provided filtering expression.
In this example we are only including V(D)J data for cells that have all of the chains, IGH, IGK, and IGL.
res <- so |>
filter_vdj(
all(c("IGH", "IGK", "IGL") %in% chains)
)
res |>
slot("meta.data") |>
filter(!is.na(clonotype_id)) |>
select(chains, cdr3) |>
head()
#> chains
#> BL6_CCACTACTCTGCAAGT-1 IGH;IGK;IGL
#> cdr3
#> BL6_CCACTACTCTGCAAGT-1 CARGDSSGYVAMDYW;CLQSDNMPYTF;CALWYSNHFIF
In this example we are removing V(D)J data for all chains except IGH.
res <- so |>
filter_vdj(chains == "IGH")
res |>
slot("meta.data") |>
filter(!is.na(clonotype_id)) |>
select(chains, cdr3) |>
head(3)
#> chains cdr3
#> BL6_ACACCAAAGAATTGTG-1 IGH CAHGSRDFDVW
#> BL6_ACACCGGCACAAGTAA-1 IGH CARHEGYYEAMDYW
#> BL6_ACAGCTATCTGCCCTA-1 IGH CARLLLRWMDYW
Summarizing per-chain data
The summarize_vdj()
function can be used to summarize
the per-chain data for each cell and add the results to the meta.data.
In this example we are calculating the median number of deletions and
the median number of insertions for each cell. The
col_names
argument can be used to name the new columns, use
‘{.col}’ to refer to the original column name.
res <- so |>
summarize_vdj(
data_cols = c("all_ins", "all_del"),
fn = stats::median,
col_names = "median_{.col}"
)
res |>
slot("meta.data") |>
filter(n_chains > 1) |>
select(matches("all_(del|ins)")) |>
head(2)
#> all_ins all_del median_all_ins median_all_del
#> BL6_ACACCAAAGAATTGTG-1 2;0 0;3 1 1.5
#> BL6_ACACCGGCACAAGTAA-1 0;0 0;6 0 3.0
This function can also be used for character strings such as the chains column. In this example we are creating a new column in the meta.data containing the unique chains for each cell.
res <- so |>
summarize_vdj(
data_cols = "chains",
fn = ~ paste0(unique(.x), collapse = "_"),
col_names = "unique_chains"
)
res |>
slot("meta.data") |>
filter(n_chains > 2) |>
select(chains, unique_chains) |>
head(2)
#> chains unique_chains
#> BL6_AGTGGGACATTTCACT-1 IGH;IGK;IGK IGH_IGK
#> BL6_AGTGGGAGTCAGATAA-1 IGH;IGK;IGK IGH_IGK
Mutating per-chain data
Another way to modify V(D)J data present in the object is with the
mutate_vdj()
function. The function behaves in a similar
way as dplyr::mutate()
, but will parse the per-chain values
for each cell and convert them to a vector. This allows vector
operations to be performed when modifying the meta.data.
In this example we are calculating the sum of all insertions and deletions for each cell and storing this information in a new column called ‘total_indels’.
res <- so |>
mutate_vdj(
total_indels = sum(all_ins, all_del)
)
res |>
slot("meta.data") |>
select(all_ins, all_del, total_indels) |>
head()
#> all_ins all_del total_indels
#> BL6_AAACGGGGTTCTGTTT-1 <NA> <NA> NA
#> BL6_AAAGATGCAACAACCT-1 0 3 3
#> BL6_AACGTTGCACGACTCG-1 0 6 6
#> BL6_AACTTTCTCGCCTGTT-1 0 5 5
#> BL6_AAGGTTCTCAGTTGAC-1 <NA> <NA> NA
#> BL6_AAGTCTGAGTACGACG-1 0 8 8
Session info
#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so; LAPACK version 3.10.0
#>
#> locale:
#> [1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C LC_TIME=C.UTF-8
#> [4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
#> [7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
#> [10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
#>
#> time zone: UTC
#> tzcode source: system (glibc)
#>
#> attached base packages:
#> [1] stats graphics grDevices utils datasets methods base
#>
#> other attached packages:
#> [1] dplyr_1.1.3 ggplot2_3.4.4 SeuratObject_4.1.4
#> [4] Seurat_4.4.0 djvdj_0.1.0
#>
#> loaded via a namespace (and not attached):
#> [1] RColorBrewer_1.1-3 jsonlite_1.8.7
#> [3] magrittr_2.0.3 spatstat.utils_3.0-3
#> [5] rmarkdown_2.25 zlibbioc_1.46.0
#> [7] fs_1.6.3 ragg_1.2.6
#> [9] vctrs_0.6.4 ROCR_1.0-11
#> [11] Rsamtools_2.16.0 memoise_2.0.1
#> [13] spatstat.explore_3.2-5 RCurl_1.98-1.12
#> [15] htmltools_0.5.6.1 sass_0.4.7
#> [17] sctransform_0.4.1 parallelly_1.36.0
#> [19] KernSmooth_2.23-21 bslib_0.5.1
#> [21] htmlwidgets_1.6.2 desc_1.4.2
#> [23] ica_1.0-3 plyr_1.8.9
#> [25] plotly_4.10.3 zoo_1.8-12
#> [27] cachem_1.0.8 igraph_1.5.1
#> [29] mime_0.12 lifecycle_1.0.3
#> [31] pkgconfig_2.0.3 Matrix_1.6-1.1
#> [33] R6_2.5.1 fastmap_1.1.1
#> [35] GenomeInfoDbData_1.2.10 fitdistrplus_1.1-11
#> [37] future_1.33.0 shiny_1.7.5.1
#> [39] digest_0.6.33 colorspace_2.1-0
#> [41] S4Vectors_0.38.2 patchwork_1.1.3
#> [43] rprojroot_2.0.3 tensor_1.5
#> [45] irlba_2.3.5.1 GenomicRanges_1.52.1
#> [47] textshaping_0.3.7 progressr_0.14.0
#> [49] fansi_1.0.5 spatstat.sparse_3.0-2
#> [51] httr_1.4.7 polyclip_1.10-6
#> [53] abind_1.4-5 compiler_4.3.1
#> [55] bit64_4.0.5 withr_2.5.1
#> [57] BiocParallel_1.34.2 MASS_7.3-60
#> [59] tools_4.3.1 lmtest_0.9-40
#> [61] httpuv_1.6.12 future.apply_1.11.0
#> [63] goftest_1.2-3 glue_1.6.2
#> [65] nlme_3.1-162 promises_1.2.1
#> [67] grid_4.3.1 Rtsne_0.16
#> [69] cluster_2.1.4 reshape2_1.4.4
#> [71] generics_0.1.3 gtable_0.3.4
#> [73] spatstat.data_3.0-1 tzdb_0.4.0
#> [75] tidyr_1.3.0 data.table_1.14.8
#> [77] hms_1.1.3 XVector_0.40.0
#> [79] sp_2.1-1 utf8_1.2.4
#> [81] BiocGenerics_0.46.0 spatstat.geom_3.2-7
#> [83] RcppAnnoy_0.0.21 ggrepel_0.9.4
#> [85] RANN_2.6.1 pillar_1.9.0
#> [87] stringr_1.5.0 vroom_1.6.4
#> [89] later_1.3.1 splines_4.3.1
#> [91] lattice_0.21-8 bit_4.0.5
#> [93] survival_3.5-5 deldir_1.0-9
#> [95] tidyselect_1.2.0 Biostrings_2.68.1
#> [97] miniUI_0.1.1.1 pbapply_1.7-2
#> [99] knitr_1.44 gridExtra_2.3
#> [101] IRanges_2.34.1 scattermore_1.2
#> [103] stats4_4.3.1 xfun_0.40
#> [105] matrixStats_1.0.0 stringi_1.7.12
#> [107] lazyeval_0.2.2 yaml_2.3.7
#> [109] evaluate_0.22 codetools_0.2-19
#> [111] tibble_3.2.1 cli_3.6.1
#> [113] uwot_0.1.16 xtable_1.8-4
#> [115] reticulate_1.34.0 systemfonts_1.0.5
#> [117] munsell_0.5.0 jquerylib_0.1.4
#> [119] GenomeInfoDb_1.36.4 Rcpp_1.0.11
#> [121] globals_0.16.2 spatstat.random_3.2-1
#> [123] png_0.1-8 parallel_4.3.1
#> [125] ellipsis_0.3.2 pkgdown_2.0.7
#> [127] readr_2.1.4 bitops_1.0-7
#> [129] listenv_0.9.0 viridisLite_0.4.2
#> [131] scales_1.2.1 ggridges_0.5.4
#> [133] crayon_1.5.2 leiden_0.4.3
#> [135] purrr_1.0.2 rlang_1.1.1
#> [137] cowplot_1.1.1