flowchart TB
subgraph inputs[" "]
direction TB
Seeds["🧬 Seed Alignments<br/><small>Curated 2A sequences</small>"]
style Seeds fill:#e1f5fe
end
subgraph databases[" "]
direction TB
UniProt[(UniProt)]
RefProt[(RefProt)]
UniParc[(UniParc)]
MGnify[(MGnify)]
IMGVR[(IMG/VR)]
style UniProt fill:#fff3e0
style RefProt fill:#fff3e0
style UniParc fill:#fff3e0
style MGnify fill:#fff3e0
style IMGVR fill:#fff3e0
end
subgraph iteration1["Iteration 1"]
direction TB
HMM1["hmmbuild<br/><small>Seed HMMs</small>"]
Search1["hmmsearch<br/><small>All databases</small>"]
Filter1["Filter<br/><small>E-value, length</small>"]
style HMM1 fill:#e8f5e9
style Search1 fill:#e8f5e9
style Filter1 fill:#e8f5e9
end
subgraph iteration2["Iteration 2"]
direction TB
HMM2["hmmbuild<br/><small>Refined HMMs</small>"]
Search2["hmmsearch<br/><small>Expanded search</small>"]
Curate["Auto-curate<br/><small>Motif validation</small>"]
style HMM2 fill:#fce4ec
style Search2 fill:#fce4ec
style Curate fill:#fce4ec
end
subgraph outputs["Outputs"]
direction TB
FinalHMM["📊 Final Models<br/><small>Class 1 & 2 HMMs</small>"]
Logos["🎨 Sequence Logos"]
Report["📝 Analysis Report"]
style FinalHMM fill:#f3e5f5
style Logos fill:#f3e5f5
style Report fill:#f3e5f5
end
Seeds --> HMM1
HMM1 --> Search1
databases --> Search1
Search1 --> Filter1
Filter1 --> HMM2
HMM2 --> Search2
databases --> Search2
Search2 --> Curate
Curate --> FinalHMM
FinalHMM --> Logos
FinalHMM --> Report
2A Peptide Discovery
Systematic identification of viral 2A self-cleaving peptides and prokaryotic ribosome stalling sequences
Overview
This project identifies 2A peptides across protein sequence databases using profile hidden Markov models (HMMs). 2A peptides are short (~15-20 residue) cis-acting oligopeptides that cause ribosomal skipping during translation at conserved Gly-Pro (GP) dipeptides.
What are 2A Peptides?
2A peptides were first discovered in picornaviruses (foot-and-mouth disease virus) where they enable production of multiple proteins from a single open reading frame without requiring a stop codon. The ribosome “skips” peptide bond formation at the Gly-Pro junction, releasing the upstream protein while continuing translation of the downstream sequence.
- Conserved C-terminal motif: DxExNPGP
- Self-cleavage occurs between Gly and Pro
- No external protease required
- Widely used in biotechnology for multi-cistronic expression
Two Classes of 2A Peptides
We distinguish two classes based on conserved sequence features:
| Class | N-terminal | Central Motif | C-terminal |
|---|---|---|---|
| Class 1 | Leucine stretch | GDVE | NPGP |
| Class 2 | Invariant Trp | EEGIE | PNPGP/PHPGP |
Pipeline Overview
Databases Searched
Eukaryotic Pipeline
| Database | Description | Size |
|---|---|---|
| UniProt Swiss-Prot | Curated protein sequences | ~570k |
| Reference Proteomes | Representative proteomes | Medium |
| UniParc | Non-redundant archive | Very large |
| MGnify | Metagenomic proteins | ~300M |
Prokaryotic Pipeline
| Database | Description | Size |
|---|---|---|
| UniProt Bacteria | Swiss-Prot bacterial proteins | ~500k |
| UniProt Archaea | Swiss-Prot archaeal proteins | ~50k |
| NCBI Viral RefSeq | Viral protein sequences | ~100MB |
| IMG/VR | Phage proteins | Millions |
| INPHARED | Curated phage genomes | ~19k genomes |
Key Analyses
Eukaryotic 2A Peptides
- Seed Search Results: Initial HMM search statistics
- Class Comparison: Differences between Class 1 and Class 2
- Sequence Logos: Visualization of conserved motifs
Prokaryotic Discovery
- Discovery Pipeline: Approach for finding novel stalling peptides
- GP Motif Analysis: Comprehensive GP context analysis
- Known Stalling Peptides: Validation against SecM, TnaC, etc.
Methods
- Pipeline Architecture: Snakemake workflow details
- HMM Approach: Profile HMM methodology
- Database Sources: Data provenance and URLs
Background
Ribosomal Stalling in Prokaryotes
Prokaryotes use related but distinct mechanisms for ribosomal stalling:
- SecM: Stalls at FXXXXWIXXXXGIRAGP to regulate secretion
- TnaC: Tryptophan-dependent stalling
- MifM: Regulates membrane protein insertion in B. subtilis
These peptides share the GP motif but have different upstream contexts and biological functions.
Biotechnology Applications
2A peptides are widely used for:
- Multi-gene expression from single vectors
- Stoichiometric co-expression of protein complexes
- CAR-T cell engineering
- Vaccine development
Resources
- Source Code: GitHub repository
- Seed Alignments:
resources/seed-alignments/ - Final Models:
results/models/final/
# Full pipeline
pixi run snakemake --cores 12
# Prokaryotic discovery only
pixi run snakemake prokaryotic_discovery --cores 12