Seed Search Results

Initial HMM search statistics for eukaryotic 2A peptides

Published

January 18, 2026

Overview

This page summarizes the initial HMM search results using seed models built from curated 2A peptide alignments.

Seed Alignments

The seed alignments contain known, experimentally validated 2A peptides:

Class	Peptides	Source
Class 1	T2A, E2A, P2A	Picornaviruses
Class 2	F2A variants	Aphthoviruses

Search Statistics

Table 1: HMM search statistics by database

Code

library(tidyverse)

# Load search statistics (when available)
results_dir <- Sys.getenv("RESULTS_DIR", "../../scratch/results")

# Example statistics table
stats <- tibble(
    Database = c('UniProt', 'RefProt', 'UniParc', 'MGnify'),
    `Class 1 Hits` = c(45, 128, 892, 3421),
    `Class 2 Hits` = c(23, 67, 445, 1893),
    `E-value < 1e-10` = c(68, 195, 1337, 5314)
)

stats

Hit Distribution

By Database

Placeholder

Actual search results will be displayed here once the pipeline has been run.

By E-value

Code

library(ggplot2)

# Example E-value distribution
evalues <- rexp(1000, rate = 1e8)
df <- tibble(evalue = evalues)

ggplot(df, aes(x = evalue)) +
    geom_histogram(bins = 50) +
    scale_x_log10() +
    labs(x = "E-value", y = "Count") +
    theme_minimal()

Quality Metrics

Alignment Length Distribution

Expected alignment lengths: - Class 1: 18-22 residues - Class 2: 20-24 residues

Conservation Scores

Key conserved positions in seed alignments: - Position -1 (G): >95% conserved - Position 0 (P): 100% conserved - Position -4 (N/P): >90% conserved

Filtering Summary

Filter	Sequences Removed
E-value > 1e-5	TBD
Length < 15	TBD
Gap % > 50%	TBD
Missing PGP motif	TBD

Next Steps

Review filtered alignments
Build refined models from high-confidence hits
Perform iteration 2 searches
Generate sequence logos