Skip to content

Architecture

Technical architecture of the aa-tRNA-seq pipeline.

Directory Structure

Text Only
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
aa-tRNA-seq-pipeline/
├── workflow/
│   ├── Snakefile              # Main entry point
│   ├── rules/                 # Modular rule files
│   │   ├── common.smk         # Utilities and helpers
│   │   ├── aatrnaseq-process.smk    # Core processing
│   │   ├── aatrnaseq-charging.smk   # Charging analysis
│   │   ├── aatrnaseq-qc.smk         # Quality control
│   │   ├── aatrnaseq-modifications.smk  # Modkit rules
│   │   └── warpdemux.smk      # Demultiplexing (conditional)
│   └── scripts/               # Python processing scripts
├── config/
│   ├── config-base.yml        # Default configuration
│   ├── config-test.yml        # Test configuration
│   └── samples-*.tsv/yml      # Sample definitions
├── cluster/
│   ├── lsf/                   # LSF cluster profile
│   └── generic/               # Generic cluster profile
├── resources/
│   ├── ref/                   # Reference sequences
│   ├── models/                # ML models
│   ├── kmers/                 # Kmer tables
│   └── tools/                 # External tools (dorado, modkit)
├── .tests/                    # Test data and scripts
├── pixi.toml                  # Pixi environment definition
└── pixi.lock                  # Locked dependencies

Snakemake Organization

Entry Point

workflow/Snakefile is the main entry point:

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Load base configuration
configfile: "config/config-base.yml"

# Dynamic PATH setup
onstart:
    # Add dorado and modkit to PATH
    dorado_path = f"resources/tools/dorado/{config['dorado_version']}/bin"
    os.environ["PATH"] = f"{dorado_path}:{os.environ['PATH']}"

# Include rule modules
include: "rules/common.smk"
include: "rules/aatrnaseq-process.smk"
include: "rules/aatrnaseq-charging.smk"
include: "rules/aatrnaseq-qc.smk"
include: "rules/aatrnaseq-modifications.smk"

# Conditionally include demux rules
if is_demux_enabled():
    include: "rules/warpdemux.smk"

# Target rule
rule all:
    input:
        pipeline_outputs()

Rule Modules

flowchart TB
    A[Snakefile] --> B[common.smk<br/>Utilities]
    A --> C[aatrnaseq-process.smk<br/>7 rules]
    A --> D[aatrnaseq-charging.smk<br/>2 rules]
    A --> E[aatrnaseq-qc.smk<br/>3 rules]
    A --> F[aatrnaseq-modifications.smk<br/>4 rules]
    A -.->|conditional| G[warpdemux.smk<br/>5 rules]

Rule Categories

Module Purpose Rules
common.smk Utilities, sample parsing, output definition -
aatrnaseq-process.smk Core pipeline 7
aatrnaseq-charging.smk Charging analysis 2
aatrnaseq-qc.smk Quality control 3
aatrnaseq-modifications.smk Modification calling 4
warpdemux.smk Demultiplexing 5

Key Functions in common.smk

Sample Management

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def parse_samples(fl):
    """Auto-detect and parse sample file (TSV or YAML)."""

def parse_samples_tsv(fl):
    """Parse TSV format: sample_id<TAB>path"""

def parse_samples_yaml(fl):
    """Parse YAML format with barcode assignments."""

def find_raw_inputs():
    """Recursively find POD5 files in run directories."""

Configuration Helpers

Python
1
2
3
4
5
6
7
8
def get_modkit_threshold_opts():
    """Build modkit threshold options from config."""

def get_pipeline_commit():
    """Get git commit ID for reproducibility."""

def is_demux_enabled():
    """Check if WarpDemuX is enabled in config."""

Output Definition

Python
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
def pipeline_outputs():
    """Define all final outputs for rule all."""
    outputs = []
    for sample in samples:
        outputs.extend([
            f"{outdir}/summary/tables/{sample}/{sample}.charging.cpm.tsv.gz",
            f"{outdir}/summary/tables/{sample}/{sample}.charging_prob.tsv.gz",
            f"{outdir}/summary/tables/{sample}/{sample}.bcerror.tsv.gz",
            # ... more outputs
        ])
    return outputs

Data Flow

Input Discovery

flowchart LR
    A[samples.tsv/yml] --> B[parse_samples]
    B --> C[find_raw_inputs]
    C --> D[samples dict<br/>path, barcode, run_id]
    D --> E[get_raw_inputs<br/>per sample]

Processing Pipeline

flowchart TD
    subgraph Per Sample
        A[POD5 files] --> B[merge_pods]
        B --> C[rebasecall]
        C --> D[ubam_to_fastq]
        D --> E[bwa_align]
        E --> F[classify_charging]
        F --> G[transfer_bam_tags]
    end

    subgraph Summaries
        G --> H[get_cca_trna]
        G --> I[base_calling_error]
        G --> J[modkit_pileup]
    end

Global Variables

Defined in common.smk:

Python
1
2
3
4
5
6
7
8
9
# Script directory path
SCRIPT_DIR = os.path.join(SNAKEFILE_DIR, "scripts")

# Output directory from config
outdir = config.get("output_directory", "results")

# Parsed samples dictionary
samples = parse_samples(config["samples"])
find_raw_inputs()  # Populates samples with POD5 paths

Configuration System

Hierarchy

flowchart TB
    A[config-base.yml<br/>Defaults] --> B[Your config.yml<br/>Overrides]
    B --> C[Command line<br/>--config key=value]
    C --> D[Final config dict]

Key Config Sections

YAML
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
# Tool paths and versions
base_calling_model: "resources/models/..."
dorado_version: 1.4.0

# Reference files
fasta: "resources/ref/..."
remora_cca_classifier: "resources/models/..."

# Command options
opts:
    dorado: "..."
    bwa: "..."
    bam_filter: "..."

# Modkit thresholds
modkit:
    filter_threshold: 0.5
    mod_thresholds: {...}

# Optional demux
warpdemux:
    enabled: false

External Tools

Tool Management

Tools are installed to resources/tools/:

Text Only
1
2
3
4
resources/tools/
├── dorado/
│   └── <version>/
│       └── bin/dorado

PATH Setup

Tools are added to PATH dynamically in Snakefile:

Python
1
2
3
onstart:
    dorado_path = f"resources/tools/dorado/{config['dorado_version']}/bin"
    os.environ["PATH"] = f"{dorado_path}:{os.environ['PATH']}"

Python Scripts

Located in workflow/scripts/:

Script Called By Purpose
transfer_tags.py transfer_bam_tags Transfer BAM tags
get_charging_table.py get_cca_trna Extract CL tag
get_trna_charging_cpm.py get_cca_trna_cpm Calculate CPM
get_bcerror_freqs.py base_calling_error Error metrics
get_align_stats.py align_stats Read statistics
extract_signal_metrics.py remora_signal_stats Signal metrics

Scripts are called via SCRIPT_DIR:

Python
1
2
3
4
params:
    src=SCRIPT_DIR,
shell:
    "python {params.src}/script.py ..."

Reproducibility

Git Commit Tracking

Python
1
2
3
4
5
6
7
def get_pipeline_commit():
    """Return current git commit ID."""
    try:
        repo = Repo(SNAKEFILE_DIR, search_parent_directories=True)
        return repo.head.commit.hexsha[:8]
    except:
        return "unknown"

Locked Dependencies

pixi.lock ensures reproducible environment:

Bash
1
2
3
4
5
# Install exact versions
pixi install

# Update lock file
pixi update

Next Steps