Quantifies the extent of overlap between to sets of intervals in terms of base-pairs. Groups that are shared between input are used to calculate the statistic for subsets of data.
Value
tibble with the following columns:
len_i
length of the intersection in base-pairslen_u
length of the union in base-pairsjaccard
value of jaccard statisticn_int
number of intersecting intervals betweenx
andy
If inputs are grouped, the return value will contain one set of values per group.
Details
The Jaccard statistic takes values of [0,1]
and is measured as:
$$ J(x,y) = \frac{\mid x \bigcap y \mid} {\mid x \bigcup y \mid} = \frac{\mid x \bigcap y \mid} {\mid x \mid + \mid y \mid - \mid x \bigcap y \mid} $$
Interval statistics can be used in combination with
dplyr::group_by()
and dplyr::do()
to calculate
statistics for subsets of data. See vignette('interval-stats')
for
examples.
See also
https://bedtools.readthedocs.io/en/latest/content/tools/jaccard.html
Other interval statistics:
bed_absdist()
,
bed_fisher()
,
bed_projection()
,
bed_reldist()
Examples
genome <- read_genome(valr_example("hg19.chrom.sizes.gz"))
x <- bed_random(genome, seed = 1010486)
y <- bed_random(genome, seed = 9203911)
bed_jaccard(x, y)
#> # A tibble: 1 × 4
#> len_i len_u jaccard n
#> <dbl> <dbl> <dbl> <dbl>
#> 1 236184699 1708774142 0.160 399981
# calculate jaccard per chromosome
bed_jaccard(
dplyr::group_by(x, chrom),
dplyr::group_by(y, chrom)
)
#> # A tibble: 25 × 5
#> chrom len_i len_u jaccard n
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 chr1 18939046 137345996 0.160 32156
#> 2 chr10 10524360 75209830 0.163 17830
#> 3 chr11 10378246 74655177 0.161 17497
#> 4 chr12 10146255 73725046 0.160 17163
#> 5 chr13 8867024 63737541 0.162 14992
#> 6 chr14 8047103 59033144 0.158 13647
#> 7 chr15 7794057 56514322 0.160 13236
#> 8 chr16 6907575 49874077 0.161 11650
#> 9 chr17 6186446 44917522 0.160 10487
#> 10 chr18 6044900 43248877 0.162 10129
#> # ℹ 15 more rows