HMM Methodology – 2A Peptide Discovery

Profile HMMs for 2A Peptides

Profile hidden Markov models (pHMMs) are probabilistic models that capture position-specific amino acid preferences and insertion/deletion patterns from multiple sequence alignments.

Why Profile HMMs?

2A peptides are challenging to identify using standard BLAST:

Short sequences (~15-20 residues) have limited statistical power
Conserved motifs are interspersed with variable regions
Distant homologs may share <30% sequence identity

Profile HMMs address these challenges by:

Modeling position-specific conservation patterns
Allowing insertions and deletions with learned penalties
Providing calibrated E-values for statistical significance

HMMER Suite

We use HMMER for all HMM operations:

Tool	Purpose
`hmmbuild`	Build profile HMM from alignment
`hmmsearch`	Search database with HMM
`hmmalign`	Align sequences to HMM
`hmmpress`	Prepare HMM database

Building Models

# Build HMM from Stockholm alignment
hmmbuild -n "2A-class-1" model.hmm alignment.sto

# Model parameters
# - Position-specific emission probabilities
# - Insertion/deletion state transitions
# - Match state consensus

Search Strategy

# Search protein database
hmmsearch --cpu 12 \
    --tblout results.tblout \
    -A results.sto \
    model.hmm database.fasta

Iterative Refinement

The pipeline uses iterative HMM refinement:

flowchart LR
    A[Seed Alignment] --> B[Build HMM]
    B --> C[Search DBs]
    C --> D[Filter Hits]
    D --> E[Merge Alignments]
    E --> F[Build Refined HMM]
    F --> C

Figure 1: Iterative HMM refinement cycle

Iteration 1: Seed Search

Start with curated seed alignments (known 2A peptides)
Build initial HMMs
Search all databases
Filter by E-value threshold (1e-5)

Auto-Curation

After iteration 2, automated curation applies:

E-value filter: Remove low-confidence hits
Length filter: Require minimum 15 residues
Gap filter: Maximum 50% gaps
Motif filter: Require C-terminal PGP motif
Outlier removal: Drop bottom 10% by bit score

Model Characteristics

Class 1 HMM Profile

Key conserved positions:

N-terminal: L-rich region
Central: GDVE motif
C-terminal: NPGP

Class 2 HMM Profile

Key conserved positions:

N-terminal: Invariant W
Central: EEGIE motif
C-terminal: PNPGP/PHPGP

E-value Interpretation

E-value	Interpretation
< 1e-10	Very confident hit
1e-10 - 1e-5	Confident hit
1e-5 - 1e-3	Possible hit, review
> 1e-3	Likely spurious

Cross-Class Matches

Due to the conserved C-terminal PGP motif, some hits may partially match both class 1 and class 2 models. Always check both search results.

Profile HMMs for 2A Peptides

Why Profile HMMs?

HMMER Suite

Building Models

Search Strategy

Iterative Refinement

Iteration 1: Seed Search

Iteration 2: Refinement

Auto-Curation

Model Characteristics

Class 1 HMM Profile

Class 2 HMM Profile

E-value Interpretation