Skip to contents

This function uses a pileup routine to examine numerate base counts from alignments at specified sites, regions, or across all read alignments, from one or more BAM files. Alignment and site filtering options are controlled by the FilterParam class. A RangedSummarizedExperiment object is returned, populated with base count statistics for each supplied BAM file.

Usage

pileup_sites(
  bamfiles,
  fasta,
  sites = NULL,
  region = NULL,
  chroms = NULL,
  param = FilterParam(),
  BPPARAM = SerialParam(),
  umi_tag = NULL,
  verbose = FALSE
)

FilterParam(
  max_depth = 10000,
  min_depth = 1L,
  min_base_quality = 20L,
  min_mapq = 0L,
  library_type = "fr-first-strand",
  bam_flags = NULL,
  only_keep_variants = FALSE,
  trim_5p = 0L,
  trim_3p = 0L,
  ftrim_5p = 0,
  ftrim_3p = 0,
  indel_dist = 0L,
  splice_dist = 0L,
  min_splice_overhang = 0L,
  homopolymer_len = 0L,
  max_mismatch_type = c(0L, 0L),
  read_bqual = c(0, 0),
  min_variant_reads = 0L,
  min_allelic_freq = 0,
  report_multiallelic = TRUE,
  remove_overlaps = TRUE
)

Arguments

bamfiles

a character vector, BamFile or BamFileList indicating 1 or more BAM files to process. If named, the names will be included in the colData of the RangedSummarizedExperiment as a sample column, otherwise the names will be taken from the basename of the BAM file.

fasta

path to genome fasta file used for read alignment. Can be provided in compressed gzip or bgzip format.

sites

a GRanges object containing regions or sites to process.

region

samtools region query string (i.e. chr1:100-1000). Can be combined with sites, in which case sites will be filtered to keep only sites within the region.

chroms

chromosomes to process, provided as a character vector. Not to be used with the region parameter.

param

object of class FilterParam() which specify various filters to apply to reads and sites during pileup.

BPPARAM

A BiocParallel class to control parallel execution. Parallel processing occurs per chromosome and is disabled when run on a single region.

umi_tag

The BAM tag containing a UMI sequence. If supplied, multiple reads with the same UMI sequence will only be counted once per position.

verbose

if TRUE, then report progress and warnings.

max_depth

maximum read depth considered at each site

min_depth

min read depth needed to report site

min_base_quality

min base quality score to consider read for pileup

min_mapq

minimum required MAPQ score. Values for each input BAM file can be provided as a vector.

library_type

read orientation, one of fr-first-strand, fr-second-strand, and unstranded. Unstranded library type will be reported with variants w.r.t the + strand. Values for each input BAM file can be provided as a vector.

bam_flags

bam flags to filter or keep, use Rsamtools::scanBamFlag() to generate.

only_keep_variants

if TRUE, then only variant sites will be reported (FALSE by default). Values for each input BAM file can be provided as a vector.

trim_5p

Bases to trim from 5' end of read alignments

trim_3p

Bases to trim from 3' end of read alignments

ftrim_5p

Fraction of bases to trim from 5' end of read alignments

ftrim_3p

Fraction of bases to trim from 3' end of read alignments

indel_dist

Exclude read if site occurs within given distance from indel event in the read

splice_dist

Exclude read if site occurs within given distance from splicing event in the read

min_splice_overhang

Exclude read if site is located adjacent to splice site with an overhang less than given length.

homopolymer_len

Exclude site if occurs within homopolymer of given length

max_mismatch_type

Exclude read if it has X different mismatch types (e.g A-to-G, G-to-C, C-to-G, is 3 mismatch types) or Y # of mismatches, must be supplied as a integer vector of length 2. e.g. c(X, Y).

read_bqual

Exclude read if more than X percent of the bases have base qualities less than Y. Numeric vector of length 2. e.g. c(0.25, 20)

min_variant_reads

Required number of reads containing a variant for a site to be reported. Calculated per bam file, such that if 1 bam file has >= min_variant_reads, then the site will be reported.

min_allelic_freq

minimum allelic frequency required for a variant to be reported in ALT assay.

report_multiallelic

if TRUE, report sites with multiple variants passing filters. If FALSE, site will not be reported.

remove_overlaps

if TRUE, enable read pair overlap detection, which will count only 1 read in regions where read pairs overlap using the htslib algorithm. In brief for each overlapping base pair the base quality of the base with the lower quality is set to 0, which discards it from being counted.

Value

A RangedSummarizedExperiment object populated with multiple assays:

  • ALT: Alternate base(s) found at each position

  • nRef: # of reads supporting the reference base

  • nAlt: # of reads supporting an alternate base

  • nA: # of reads with A

  • nT: # of reads with T

  • nC: # of reads with C

  • nG: # of reads with G

The rowRanges() contains the genomic interval for each site, along with:

  • REF: The reference base

  • rpbz: Mann-Whitney U test of Read Position Bias from bcftools, extreme negative or positive values indicate more bias.

  • vdb: Variant Distance Bias for filtering splice-site artifacts from bcftools, lower values indicate more bias.

  • sor Strand Odds Ratio Score, strand bias estimated by the Symmetric Odds Ratio test, based on GATK code. Higher values indicate more bias.

The rownames will be populated with the format site_[seqnames]_[position(1-based)]_[strand], with strand being encoded as 1 = +, 2 = -, and 3 = *.

See also

Other pileup: pileup_cells()

Examples

library(SummarizedExperiment)
bamfn <- raer_example("SRR5564269_Aligned.sortedByCoord.out.md.bam")
bam2fn <- raer_example("SRR5564277_Aligned.sortedByCoord.out.md.bam")
fafn <- raer_example("human.fasta")

rse <- pileup_sites(bamfn, fafn)

fp <- FilterParam(only_keep_variants = TRUE, min_depth = 55)
pileup_sites(bamfn, fafn, param = fp)
#> class: RangedSummarizedExperiment 
#> dim: 7 1 
#> metadata(0):
#> assays(7): ALT nRef ... nC nG
#> rownames(7): site_DHFR_207_2 site_DHFR_208_2 ... site_DHFR_300_2
#>   site_DHFR_332_2
#> rowData names(4): REF rpbz vdb sor
#> colnames(1): SRR5564269_Aligned.sortedByCoord.out.md.bam
#> colData names(1): sample


# using multiple bam files

bams <- rep(c(bamfn, bam2fn), each = 3)
sample_ids <- paste0(rep(c("KO", "WT"), each = 3), 1:3)
names(bams) <- sample_ids

fp <- FilterParam(only_keep_variants = TRUE)
rse <- pileup_sites(bams, fafn, param = fp)
rse
#> class: RangedSummarizedExperiment 
#> dim: 74 6 
#> metadata(0):
#> assays(7): ALT nRef ... nC nG
#> rownames(74): site_SSR3_102_2 site_SSR3_125_2 ... site_DHFR_430_2
#>   site_DHFR_513_2
#> rowData names(4): REF rpbz vdb sor
#> colnames(6): KO1 KO2 ... WT2 WT3
#> colData names(1): sample

rse$condition <- substr(rse$sample, 1, 2)
assays(rse)
#> List of length 7
#> names(7): ALT nRef nAlt nA nT nC nG

colData(rse)
#> DataFrame with 6 rows and 2 columns
#>          sample   condition
#>     <character> <character>
#> KO1         KO1          KO
#> KO2         KO2          KO
#> KO3         KO3          KO
#> WT1         WT1          WT
#> WT2         WT2          WT
#> WT3         WT3          WT

rowRanges(rse)
#> GRanges object with 74 ranges and 4 metadata columns:
#>                   seqnames    ranges strand |         REF       rpbz
#>                      <Rle> <IRanges>  <Rle> | <character>  <numeric>
#>   site_SSR3_102_2     SSR3       102      - |           T   2.611800
#>   site_SSR3_125_2     SSR3       125      - |           C   0.623238
#>   site_SSR3_156_2     SSR3       156      - |           C   1.875144
#>   site_SSR3_176_2     SSR3       176      - |           A  -0.676423
#>   site_SSR3_198_2     SSR3       198      - |           A   1.817678
#>               ...      ...       ...    ... .         ...        ...
#>   site_DHFR_397_2     DHFR       397      - |           A -2.7341600
#>   site_DHFR_399_2     DHFR       399      - |           G -0.2094944
#>   site_DHFR_423_2     DHFR       423      - |           T -0.0815516
#>   site_DHFR_430_2     DHFR       430      - |           A -2.6781858
#>   site_DHFR_513_2     DHFR       513      - |           T -1.2475213
#>                           vdb       sor
#>                     <numeric> <numeric>
#>   site_SSR3_102_2 2.21621e-02  2.776147
#>   site_SSR3_125_2 2.21621e-02  0.478641
#>   site_SSR3_156_2 2.21621e-02  2.786335
#>   site_SSR3_176_2 2.00523e-05  2.154124
#>   site_SSR3_198_2 2.21621e-02  2.797103
#>               ...         ...       ...
#>   site_DHFR_397_2   0.0221621  2.775538
#>   site_DHFR_399_2   0.0221621  0.306541
#>   site_DHFR_423_2   0.0221621  2.773352
#>   site_DHFR_430_2   0.0221621  2.773426
#>   site_DHFR_513_2   0.0221621  2.772591
#>   -------
#>   seqinfo: 3 sequences from an unspecified genome

# specifying regions to query using GRanges object

sites <- rowRanges(rse)
rse <- pileup_sites(bams, fafn, sites = sites)
rse
#> class: RangedSummarizedExperiment 
#> dim: 74 6 
#> metadata(0):
#> assays(7): ALT nRef ... nC nG
#> rownames(74): site_SSR3_102_2 site_SSR3_125_2 ... site_DHFR_430_2
#>   site_DHFR_513_2
#> rowData names(4): REF rpbz vdb sor
#> colnames(6): KO1 KO2 ... WT2 WT3
#> colData names(1): sample

rse <- pileup_sites(bams, fafn, chroms = c("SPCS3", "DHFR"))
rse
#> class: RangedSummarizedExperiment 
#> dim: 1166 6 
#> metadata(0):
#> assays(7): ALT nRef ... nC nG
#> rownames(1166): site_SPCS3_1_1 site_SPCS3_2_1 ... site_DHFR_517_2
#>   site_DHFR_518_2
#> rowData names(4): REF rpbz vdb sor
#> colnames(6): KO1 KO2 ... WT2 WT3
#> colData names(1): sample

rse <- pileup_sites(bams, fafn, region = "DHFR:100-101")
rse
#> class: RangedSummarizedExperiment 
#> dim: 2 6 
#> metadata(0):
#> assays(7): ALT nRef ... nC nG
#> rownames(2): site_DHFR_100_2 site_DHFR_101_2
#> rowData names(4): REF rpbz vdb sor
#> colnames(6): KO1 KO2 ... WT2 WT3
#> colData names(1): sample