Workflow Overview¶

This page provides a high-level view of the aa-tRNA-seq pipeline architecture and data flow.

Pipeline Architecture¶

The pipeline is organized into modular Snakemake rule files:

flowchart TB
    subgraph rules[Rule Files]
        A[aatrnaseq-process.smk<br/>Core processing]
        B[aatrnaseq-charging.smk<br/>Charging analysis]
        C[aatrnaseq-qc.smk<br/>Quality control]
        D[aatrnaseq-modifications.smk<br/>Modification calling]
        OR[aatrnaseq-odds-ratios.smk<br/>Odds ratio analysis]
        R[aatrnaseq-report.smk<br/>QC report]
        E[warpdemux.smk<br/>Demultiplexing<br/><i>conditional</i>]
    end

    subgraph common[common.smk]
        F[Sample parsing]
        G[Helper functions]
        H[Output definitions]
    end

    I[Snakefile<br/>Main entry] --> rules
    I --> common

Complete Pipeline Flow¶

Standard Pipeline (No Demultiplexing)¶

flowchart TB
    subgraph Input
        A[POD5 files<br/>per sample]
    end

    subgraph Processing[aatrnaseq-process.smk]
        B[merge_pods<br/>Merge POD5s]
        C[rebasecall<br/>Dorado basecalling]
        D[ubam_to_fastq<br/>Extract FASTQ]
        E[bwa_align<br/>Align to reference]
        F[classify_charging<br/>Remora ML]
        G[transfer_bam_tags<br/>Rename tags]
        G2[add_adapter_tags<br/>PT tags]
        G3[finalize_bam<br/>Symlink final BAM]
    end

    subgraph Charging[aatrnaseq-charging.smk]
        H[get_cca_trna<br/>Extract probabilities]
        I[get_cca_trna_cpm<br/>Calculate CPM]
    end

    subgraph QC[aatrnaseq-qc.smk]
        J[base_calling_error<br/>Error metrics]
        K[align_stats<br/>Read statistics]
        L[remora_signal_stats<br/>Signal metrics]
    end

    subgraph Mods[aatrnaseq-modifications.smk]
        M[bam_to_coverage<br/>Coverage tracks]
        N[modkit_pileup<br/>Modification consensus]
        O[modkit_extract_calls<br/>Per-read mods]
        P[modkit_extract_full<br/>Full export]
    end

    subgraph OddsRatios[aatrnaseq-odds-ratios.smk]
        Q[compute_odds_ratios<br/>Pairwise OR]
    end

    subgraph Report[aatrnaseq-report.smk]
        R[render_combined_qc_report<br/>QC report]
    end

    A --> B --> C --> D --> E --> F --> G --> G2 --> G3

    G3 --> H --> I
    G3 --> J
    G3 --> K
    G3 --> L
    G3 --> M
    G3 --> N
    G3 --> O
    G3 --> P
    O --> Q
    H --> Q
    J --> R
    H --> R
    I --> R
    K --> R

With Demultiplexing (WarpDemuX + EDX)¶

flowchart TB
    subgraph Input
        A[Pooled POD5 files<br/>per run]
    end

    subgraph WDX[WDX Demultiplexing]
        B[warpdemux<br/>Barcode prediction]
        C[parse_warpdemux<br/>Create mapping]
        D[extract_sample_reads<br/>Per-sample IDs]
        E[split_pod5<br/>Split by WDX barcode]
    end

    subgraph Standard[Standard Pipeline]
        F[rebasecall]
    end

    subgraph EDX[EDX Early Splitting]
        G[detect_edx_adapters<br/>3' adapter ID per read]
        H[extract_edx_read_ids]
        I[filter_fastq_by_edx]
        J[filter_pod5_by_edx]
    end

    subgraph Downstream[Downstream Processing]
        K[bwa_align → classify_charging → ...]
    end

    A --> B --> C --> D --> E --> F --> G --> H
    H --> I --> K
    H --> J --> K

Rule Categories¶

Processing Rules¶

Core data processing from raw signal to classified reads:

Rule	Purpose	GPU
`merge_pods`	Combine POD5 files per sample	No
`rebasecall`	Basecall with Dorado	Yes
`ubam_to_fastq`	Extract reads for alignment	No
`bwa_idx`	Build BWA index	No
`bwa_align`	Align reads to reference	No
`classify_charging`	ML charging classification	No
`transfer_bam_tags`	Rename ML→CL tags	No
`add_adapter_tags`	Add PT tags for adapter positions	No
`finalize_bam`	Symlink final BAM	No

Charging Analysis Rules¶

Extract and summarize charging classification:

Rule	Purpose
`get_cca_trna`	Extract per-read charging scores
`get_cca_trna_cpm`	Calculate CPM-normalized counts

Quality Control Rules¶

Generate QC metrics and statistics:

Rule	Purpose
`compute_reference_similarity`	Pairwise reference sequence similarity matrix
`base_calling_error`	Per-position error frequencies
`align_stats`	Read counts through pipeline
`remora_signal_stats`	Raw signal metrics

Modification Rules¶

RNA modification calling with Modkit:

Rule	Purpose
`bam_to_coverage`	Generate coverage tracks
`modkit_pileup`	Per-site modification consensus
`modkit_extract_calls`	Per-read modification calls
`modkit_extract_full`	Comprehensive modification export

Odds Ratio Rules¶

Per-tRNA pairwise modification odds ratios:

Rule	Purpose
`compute_odds_ratios`	Pairwise modification odds ratios per tRNA

Report Rules¶

QC report generation:

Rule	Purpose
`render_combined_qc_report`	Combined Quarto QC report with per-sample tabs

Demultiplexing Rules¶

Optional WarpDemuX barcode demultiplexing:

Rule	Purpose
`warpdemux`	Run WDX barcode prediction
`parse_warpdemux`	Parse predictions to mapping
`extract_sample_reads`	Filter reads by WDX barcode
`split_pod5`	Create per-sample WDX POD5s
`detect_edx_adapters`	Detect 3' adapter identity per read
`extract_edx_read_ids`	Extract matching read IDs for EDX
`filter_fastq_by_edx`	Create EDX-filtered FASTQ
`filter_pod5_by_edx`	Create EDX-filtered POD5
`edx_concordance`	WDX vs EDX concordance table

Key Processing Steps¶

1. POD5 Merging¶

Individual POD5 files from a sequencing run are merged into a single file per sample:

Text Only
run1/pod5_pass/*.pod5  ─┐
run1/pod5_fail/*.pod5  ─┼──► sample.pod5
run2/pod5/*.pod5       ─┘

2. Basecalling¶

Dorado re-basecalls with:

Move tables (--emit-moves) required for Remora
Modification calling (--modified-bases pseU m5C inosine_m6A)
High-accuracy model (rna004_130bps_sup@v5.3.0)

3. Alignment¶

BWA MEM with RNA-optimized parameters:

-x ont2d preset for ONT reads
-F 20: Unmapped and reverse-strand read removal

4. Charging Classification¶

Remora ML model analyzes signal at CCA 3' end:

Input: POD5 (signal) + BAM (alignment)
Output: ML tag (0-255 score)
Threshold: ≥200 = charged

Remora Classification Details

A few notes about Remora classification for charged vs. uncharged tRNA reads:

This step retains only full-length tRNA reads (with an allowance for signal loss at the 5' end of nanopore direct RNA sequencing)
The current approach does not rely on differences in adapter sequences attached to charged vs. uncharged tRNA molecules (though these sequences are retained as separate entries in the alignment reference). The pipeline relies exclusively on signal data over a 6-nucleotide modification kmer spanning the universal CCA 3' end of tRNA and the first three nucleotides of the 3' adapter (CCAGGC) to distinguish charged and uncharged reads.

5. Tag Transfer¶

Original Remora tags are renamed to avoid conflicts:

ML → CL (charging likelihood)
MM → CM (charging metadata)

6. Adapter Position Tagging¶

The add_adapter_tags rule adds PT tags with adapter boundaries:

Uses parasail Smith-Waterman alignment
Detects 5' and 3' adapter positions
Can infer 5' adapter from alignment position when truncated

Resource Requirements¶

GPU Rules¶

These rules require GPU access:

Rule	Typical Runtime	Memory
`rebasecall`	30-60 min/sample	24 GB

CPU-Intensive Rules¶

Rule	Threads	Memory
`merge_pods`	12	16 GB
`bwa_align`	12	24 GB
`classify_charging`	8	24 GB
`modkit_extract_full`	12	48 GB

Memory-Intensive Rules¶

Rule	Memory
`modkit_extract_calls`	96 GB
`warpdemux`	32 GB
`remora_signal_stats`	24 GB

Configuration Points¶

Key parameters that affect pipeline behavior:

Parameter	Affects
`opts.dorado`	Basecalling modifications
`opts.bwa`	Alignment sensitivity
`opts.bam_filter`	Full-length read filtering
`modkit.mod_thresholds`	Modification calling stringency
`warpdemux.barcode_kit`	Demultiplexing model

Next Steps¶

Rules Reference - Detailed rule documentation
Scripts Reference - Python scripts documentation
Demultiplexing - WarpDemuX setup guide