Skip to contents

OligodT will prime at A-rich regions in an RNA. Reverse transcription from these internal priming sites will install an oligodT sequence at the 3' end of the cDNA. Sequence variants within these internal priming sites are enriched for variants converting the genomic sequence to the A encoded by the oligodT primer. Trimming poly(A) from the 3' ends of reads reduces but does not eliminate these signals

This function will identify regions that are enriched for mispriming events. Reads that were trimmed to remove poly(A) (encoded in the pa tag by 10x Genomics) are identified. The aligned 3' positions of these reads are counted, and sites passing thresholds (at least 2 reads) are retained as possible sites of mispriming. Be default regions 5 bases upstream and 20 bases downstream of these putative mispriming sites are returned.

Usage

find_mispriming_sites(
  bamfile,
  fasta,
  pos_5p = 5,
  pos_3p = 20,
  min_reads = 2,
  tag = "pa",
  tag_values = 3:300,
  n_reads_per_chunk = 1e+06,
  verbose = TRUE
)

Arguments

bamfile

path to bamfile

fasta

path to fasta file

pos_5p

distance 5' of mispriming site to define mispriming region

pos_3p

distance 3' of mispriming site to define mispriming region

min_reads

minimum required number of reads at a mispriming site

tag

bam tag containing number of poly(A) bases trimmed

tag_values

range of values required for read to be considered

n_reads_per_chunk

number of reads to process in memory, see Rsamtools::BamFile()

verbose

if true report progress

Value

A GenomicsRanges containing regions enriched for putative mispriming events. The n_reads column specifies the number of polyA trimmed reads overlapping the mispriming region. mean_pal indicates the mean length of polyA sequence trimmed from reads overlapping the region. The n_regions column specifies the number overlapping independent regions found in each chunk (dictated by n_reads_per_chunk). The A_freq column indicates the frequency of A bases within the region.

Examples

bam_fn <- raer_example("5k_neuron_mouse_possort.bam")
fa_fn <- raer_example("mouse_tiny.fasta")
find_mispriming_sites(bam_fn, fa_fn)
#> working on 2:580-633:- to 2:580-633:-
#> GRanges object with 1 range and 4 metadata columns:
#>       seqnames    ranges strand |  mean_pal   n_reads n_regions    A_freq
#>          <Rle> <IRanges>  <Rle> | <numeric> <integer> <integer> <numeric>
#>   [1]        2   560-585      - |   34.6667         6         1  0.615385
#>   -------
#>   seqinfo: 4 sequences from an unspecified genome