Data Manipulation

This vignette provides detailed examples for manipulating V(D)J data imported into a single-cell object using djvdj. For the examples shown below, we use data for splenocytes from BL6 and MD4 mice collected using the 10X Genomics scRNA-seq platform. MD4 B cells are monoclonal and specifically bind hen egg lysozyme.

library(djvdj)
library(Seurat)
library(ggplot2)
library(dplyr)

# Load GEX data
data_dir <- system.file("extdata/splen", package = "djvdj")

gex_dirs <- c(
  BL6 = file.path(data_dir, "BL6_GEX/filtered_feature_bc_matrix"),
  MD4 = file.path(data_dir, "MD4_GEX/filtered_feature_bc_matrix")
)

so <- gex_dirs |>
  Read10X() |>
  CreateSeuratObject() |>
  AddMetaData(splen_meta)

# Add V(D)J data to object
vdj_dirs <- c(
  BL6 = system.file("extdata/splen/BL6_BCR", package = "djvdj"),
  MD4 = system.file("extdata/splen/MD4_BCR", package = "djvdj")
)

so <- so |>
  import_vdj(
    vdj_dirs,
    define_clonotypes = "cdr3_gene",
    include_mutations = TRUE
  )

Filtering V(D)J data

Per-chain data can be filtered with the filter_vdj() function. The per-chain values for each cell are parsed based on the ; separator and converted to a vector. This allows vector operations to be used for filtering. This function will not remove cells from the object, but instead remove the V(D)J data for cells that do not match the provided filtering expression.

In this example we are only including V(D)J data for cells that have all of the chains, IGH, IGK, and IGL.

res <- so |>
  filter_vdj(
    all(c("IGH", "IGK", "IGL") %in% chains)
  )

res |>
  slot("meta.data") |>
  filter(!is.na(clonotype_id)) |>
  select(chains, cdr3) |>
  head()
#>                             chains
#> BL6_CCACTACTCTGCAAGT-1 IGH;IGK;IGL
#>                                                           cdr3
#> BL6_CCACTACTCTGCAAGT-1 CARGDSSGYVAMDYW;CLQSDNMPYTF;CALWYSNHFIF

In this example we are removing V(D)J data for all chains except IGH.

res <- so |>
  filter_vdj(chains == "IGH")

res |>
  slot("meta.data") |>
  filter(!is.na(clonotype_id)) |>
  select(chains, cdr3) |>
  head(3)
#>                        chains           cdr3
#> BL6_ACACCAAAGAATTGTG-1    IGH    CAHGSRDFDVW
#> BL6_ACACCGGCACAAGTAA-1    IGH CARHEGYYEAMDYW
#> BL6_ACAGCTATCTGCCCTA-1    IGH   CARLLLRWMDYW

Summarizing per-chain data

The summarize_vdj() function can be used to summarize the per-chain data for each cell and add the results to the meta.data. In this example we are calculating the median number of deletions and the median number of insertions for each cell. The col_names argument can be used to name the new columns, use ‘{.col}’ to refer to the original column name.

res <- so |>
  summarize_vdj(
    data_cols = c("all_ins", "all_del"),
    fn        = stats::median,
    col_names = "median_{.col}"
  )

res |>
  slot("meta.data") |>
  filter(n_chains > 1) |>
  select(matches("all_(del|ins)")) |>
  head(2)
#>                        all_ins all_del median_all_ins median_all_del
#> BL6_ACACCAAAGAATTGTG-1     2;0     0;3              1            1.5
#> BL6_ACACCGGCACAAGTAA-1     0;0     0;6              0            3.0

This function can also be used for character strings such as the chains column. In this example we are creating a new column in the meta.data containing the unique chains for each cell.

res <- so |>
  summarize_vdj(
    data_cols = "chains",
    fn        = ~ paste0(unique(.x), collapse = "_"),
    col_names = "unique_chains"
  )

res |>
  slot("meta.data") |>
  filter(n_chains > 2) |>
  select(chains, unique_chains) |>
  head(2)
#>                             chains unique_chains
#> BL6_AGTGGGACATTTCACT-1 IGH;IGK;IGK       IGH_IGK
#> BL6_AGTGGGAGTCAGATAA-1 IGH;IGK;IGK       IGH_IGK

Mutating per-chain data

Another way to modify V(D)J data present in the object is with the mutate_vdj() function. The function behaves in a similar way as dplyr::mutate(), but will parse the per-chain values for each cell and convert them to a vector. This allows vector operations to be performed when modifying the meta.data.

In this example we are calculating the sum of all insertions and deletions for each cell and storing this information in a new column called ‘total_indels’.

res <- so |>
  mutate_vdj(
    total_indels = sum(all_ins, all_del)
  )

res |>
  slot("meta.data") |>
  select(all_ins, all_del, total_indels) |>
  head()
#>                        all_ins all_del total_indels
#> BL6_AAACGGGGTTCTGTTT-1    <NA>    <NA>           NA
#> BL6_AAAGATGCAACAACCT-1       0       3            3
#> BL6_AACGTTGCACGACTCG-1       0       6            6
#> BL6_AACTTTCTCGCCTGTT-1       0       5            5
#> BL6_AAGGTTCTCAGTTGAC-1    <NA>    <NA>           NA
#> BL6_AAGTCTGAGTACGACG-1       0       8            8

Session info

#> R version 4.3.1 (2023-06-16)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 22.04.3 LTS
#> 
#> Matrix products: default
#> BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
#> LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0
#> 
#> locale:
#>  [1] LC_CTYPE=C.UTF-8       LC_NUMERIC=C           LC_TIME=C.UTF-8       
#>  [4] LC_COLLATE=C.UTF-8     LC_MONETARY=C.UTF-8    LC_MESSAGES=C.UTF-8   
#>  [7] LC_PAPER=C.UTF-8       LC_NAME=C              LC_ADDRESS=C          
#> [10] LC_TELEPHONE=C         LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C   
#> 
#> time zone: UTC
#> tzcode source: system (glibc)
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> other attached packages:
#> [1] dplyr_1.1.3        ggplot2_3.4.4      SeuratObject_4.1.4
#> [4] Seurat_4.4.0       djvdj_0.1.0       
#> 
#> loaded via a namespace (and not attached):
#>   [1] RColorBrewer_1.1-3      jsonlite_1.8.7         
#>   [3] magrittr_2.0.3          spatstat.utils_3.0-3   
#>   [5] rmarkdown_2.25          zlibbioc_1.46.0        
#>   [7] fs_1.6.3                ragg_1.2.6             
#>   [9] vctrs_0.6.4             ROCR_1.0-11            
#>  [11] Rsamtools_2.16.0        memoise_2.0.1          
#>  [13] spatstat.explore_3.2-5  RCurl_1.98-1.12        
#>  [15] htmltools_0.5.6.1       sass_0.4.7             
#>  [17] sctransform_0.4.1       parallelly_1.36.0      
#>  [19] KernSmooth_2.23-21      bslib_0.5.1            
#>  [21] htmlwidgets_1.6.2       desc_1.4.2             
#>  [23] ica_1.0-3               plyr_1.8.9             
#>  [25] plotly_4.10.3           zoo_1.8-12             
#>  [27] cachem_1.0.8            igraph_1.5.1           
#>  [29] mime_0.12               lifecycle_1.0.3        
#>  [31] pkgconfig_2.0.3         Matrix_1.6-1.1         
#>  [33] R6_2.5.1                fastmap_1.1.1          
#>  [35] GenomeInfoDbData_1.2.10 fitdistrplus_1.1-11    
#>  [37] future_1.33.0           shiny_1.7.5.1          
#>  [39] digest_0.6.33           colorspace_2.1-0       
#>  [41] S4Vectors_0.38.2        patchwork_1.1.3        
#>  [43] rprojroot_2.0.3         tensor_1.5             
#>  [45] irlba_2.3.5.1           GenomicRanges_1.52.1   
#>  [47] textshaping_0.3.7       progressr_0.14.0       
#>  [49] fansi_1.0.5             spatstat.sparse_3.0-2  
#>  [51] httr_1.4.7              polyclip_1.10-6        
#>  [53] abind_1.4-5             compiler_4.3.1         
#>  [55] bit64_4.0.5             withr_2.5.1            
#>  [57] BiocParallel_1.34.2     MASS_7.3-60            
#>  [59] tools_4.3.1             lmtest_0.9-40          
#>  [61] httpuv_1.6.12           future.apply_1.11.0    
#>  [63] goftest_1.2-3           glue_1.6.2             
#>  [65] nlme_3.1-162            promises_1.2.1         
#>  [67] grid_4.3.1              Rtsne_0.16             
#>  [69] cluster_2.1.4           reshape2_1.4.4         
#>  [71] generics_0.1.3          gtable_0.3.4           
#>  [73] spatstat.data_3.0-1     tzdb_0.4.0             
#>  [75] tidyr_1.3.0             data.table_1.14.8      
#>  [77] hms_1.1.3               XVector_0.40.0         
#>  [79] sp_2.1-1                utf8_1.2.4             
#>  [81] BiocGenerics_0.46.0     spatstat.geom_3.2-7    
#>  [83] RcppAnnoy_0.0.21        ggrepel_0.9.4          
#>  [85] RANN_2.6.1              pillar_1.9.0           
#>  [87] stringr_1.5.0           vroom_1.6.4            
#>  [89] later_1.3.1             splines_4.3.1          
#>  [91] lattice_0.21-8          bit_4.0.5              
#>  [93] survival_3.5-5          deldir_1.0-9           
#>  [95] tidyselect_1.2.0        Biostrings_2.68.1      
#>  [97] miniUI_0.1.1.1          pbapply_1.7-2          
#>  [99] knitr_1.44              gridExtra_2.3          
#> [101] IRanges_2.34.1          scattermore_1.2        
#> [103] stats4_4.3.1            xfun_0.40              
#> [105] matrixStats_1.0.0       stringi_1.7.12         
#> [107] lazyeval_0.2.2          yaml_2.3.7             
#> [109] evaluate_0.22           codetools_0.2-19       
#> [111] tibble_3.2.1            cli_3.6.1              
#> [113] uwot_0.1.16             xtable_1.8-4           
#> [115] reticulate_1.34.0       systemfonts_1.0.5      
#> [117] munsell_0.5.0           jquerylib_0.1.4        
#> [119] GenomeInfoDb_1.36.4     Rcpp_1.0.11            
#> [121] globals_0.16.2          spatstat.random_3.2-1  
#> [123] png_0.1-8               parallel_4.3.1         
#> [125] ellipsis_0.3.2          pkgdown_2.0.7          
#> [127] readr_2.1.4             bitops_1.0-7           
#> [129] listenv_0.9.0           viridisLite_0.4.2      
#> [131] scales_1.2.1            ggridges_0.5.4         
#> [133] crayon_1.5.2            leiden_0.4.3           
#> [135] purrr_1.0.2             rlang_1.1.1            
#> [137] cowplot_1.1.1

October 24 2023

Filtering V(D)J data

Summarizing per-chain data

Mutating per-chain data

Session info