Prokaryotic Discovery Pipeline

Systematic identification of novel ribosomal stalling peptides

Published

January 18, 2026

Overview

Prokaryotes use ribosomal stalling peptides for regulatory purposes, distinct from viral 2A peptides but sharing the conserved GP motif. This pipeline aims to discover novel stalling peptide families.

Discovery Strategy

We employ two complementary approaches:

flowchart TB
    subgraph "Approach 1: Seed-Based"
        S1[Known Stalling Peptides]
        S2[Build HMMs by Motif]
        S3[Search Prokaryotic DBs]
        S4[Cluster & Refine]
        S1 --> S2 --> S3 --> S4
    end

    subgraph "Approach 2: GP-Based"
        G1[Extract All GP Motifs]
        G2[Annotate Domains]
        G3[Filter Inter-Domain]
        G4[Cluster by Context]
        G1 --> G2 --> G3 --> G4
    end

    S4 --> Merge
    G4 --> Merge
    Merge[Merge Results] --> Final[Novel Peptide Families]

Figure 1: Dual approach for prokaryotic stalling peptide discovery

Approach 1: Seed-Based Discovery

Known Stalling Peptides

Peptide	Organism	Motif	Function
SecM	E. coli	RAGP	Secretion regulation
TnaC	E. coli	GP-like	Tryptophan response
MifM	B. subtilis	GIAGP	Membrane insertion
CydA	E. coli	RAGP	Cytochrome regulation

Motif Families

Stalling peptides are grouped by their upstream motif:

RAGP family: R-A-G-P pattern
RAPG family: R-A-P-G variant
QAPP family: Q-A-P-P pattern
Other GP: Divergent contexts

HMM Building

rule build_comprehensive_hmm:
    """Build HMM from all known stalling peptides."""
    input:
        fasta="resources/stalling-peptides/known_stalling_peptides.fasta"
    output:
        hmm=RESULTS_DIR + "/prokaryotic/models/seed/comprehensive.hmm"

Approach 2: Unbiased GP Discovery

Step 1: Extract GP Motifs

Extract all sequences containing GP with upstream context:

rule extract_gp_motifs:
    """Extract all GP-containing sequences with context."""
    params:
        upstream=30,    # 30 residues before GP
        downstream=15   # 15 residues after GP

Step 2: Domain Annotation

Annotate extracted sequences with Pfam domains to identify inter-domain GP motifs (more likely to be functional stalling sites).

rule run_hmmscan:
    """Annotate domains in GP-containing sequences."""
    input:
        sequences="gp_sequences.fasta",
        pfam_db=DATA_DIR + "/pfam/Pfam-A.hmm"

Step 3: Filter Inter-Domain

Focus on GP motifs that fall between annotated domains, as these are more likely to function in translational regulation.

Step 4: Cluster by Context

Cluster sequences by: - Amino acid composition - Position-specific patterns - Upstream sequence similarity

Databases Searched

Database	Sequences	Description
UniProt Bacteria	~500k	Swiss-Prot bacterial proteins
UniProt Archaea	~50k	Swiss-Prot archaeal proteins
IMG/VR	Millions	Phage proteins
NCBI Viral RefSeq	~100MB	Viral reference sequences
INPHARED	~19k genomes	Curated phage genomes

Expected Output

Novel Peptide Families

Clustered sequences with similar context
Consensus motifs for each cluster
HMM profiles for validation searches
Taxonomic distribution analysis

Validation

Compare discovered peptides against:

Known stalling peptides (positive control)
Random GP occurrences (negative control)
Literature-reported ribosome profiling data

Running the Pipeline

# Full prokaryotic discovery
pixi run snakemake prokaryotic_discovery --profile cluster/slurm

# Just the seed-based approach
pixi run snakemake search_comprehensive_hmm --profile cluster/slurm

# Just the GP analysis
pixi run snakemake extract_gp_motifs --profile cluster/slurm