flowchart LR
subgraph Protein
D1[Domain A] --> L[Linker + GP] --> D2[Domain B]
end
subgraph Types
T1[Inter-domain GP]
T2[Intra-domain GP]
T3[Terminal GP]
end
GP Motif Analysis
Comprehensive analysis of Gly-Pro motifs in prokaryotic proteins
Overview
The Gly-Pro (GP) dipeptide is central to ribosomal stalling. This analysis examines GP motif distribution, context, and conservation across prokaryotic proteomes.
GP Motif Extraction
Parameters
prokaryotic:
gp_upstream: 30 # Residues upstream of GP
gp_downstream: 15 # Residues downstream of GPExpected Distribution
GP motifs occur throughout proteins but functional stalling sites have distinctive features:
| Feature | Random GP | Stalling GP |
|---|---|---|
| Position | Any | Inter-domain |
| Upstream | Variable | Conserved pattern |
| Downstream | Variable | Often P or charged |
Context Analysis
Upstream Patterns
Functional stalling peptides show conserved patterns upstream of GP:
-30 -20 -10 GP
| | | ||
L---F--W------R/K--AGP (SecM-like)
W------E------G----GP (TnaC-like)
Position-Specific Conservation
| Position | SecM-like | TnaC-like | Random |
|---|---|---|---|
| -1 (A/G) | High | Medium | Low |
| -2 (R/K) | High | Low | Low |
| -3 | Variable | Variable | Variable |
| -10 (W) | High | High | Low |
Domain Context
Inter-Domain vs Intra-Domain
GP motifs are classified by their domain context:
- Inter-domain: Between two annotated Pfam domains
- Intra-domain: Within a single domain
- Terminal: Near N- or C-terminus
Enrichment Analysis
Functional stalling GPs are enriched in inter-domain regions:
| Context | All GP | Stalling GP | Enrichment |
|---|---|---|---|
| Inter-domain | 15% | 60% | 4x |
| Intra-domain | 70% | 25% | 0.4x |
| Terminal | 15% | 15% | 1x |
Clustering Results
GP motif sequences were clustered using MMseqs2 at 70% sequence identity to identify conserved families.
Code
library(tidyverse)
library(scales)
cluster_file <- here::here("results/prokaryotic/gp_analysis/bacteria/gp_clusters.tsv.gz")
if (file.exists(cluster_file)) {
clusters_df <- read_tsv(cluster_file, show_col_types = FALSE)
data_available <- TRUE
} else {
data_available <- FALSE
message("Cluster data not found at ", cluster_file)
}Summary Statistics
Code
if (data_available) {
n_total_motifs <- nrow(clusters_df)
n_unique_seqs <- n_distinct(clusters_df$context_sequence)
n_clusters <- n_distinct(clusters_df$cluster_id)
duplication_ratio <- n_total_motifs / n_unique_seqs
cluster_stats <- clusters_df |>
group_by(cluster_id) |>
summarise(
total_motifs = n(),
unique_seqs = n_distinct(context_sequence),
.groups = "drop"
)
n_singletons <- sum(cluster_stats$unique_seqs == 1)
pct_singletons <- 100 * n_singletons / n_clusters
tibble(
Metric = c(
"Total GP motifs",
"Unique sequences",
"Total clusters",
"Singleton clusters",
"Duplication ratio",
"Largest cluster (unique)",
"Largest cluster (total)"
),
Value = c(
comma(n_total_motifs),
comma(n_unique_seqs),
comma(n_clusters),
paste0(comma(n_singletons), " (", round(pct_singletons, 1), "%)"),
paste0(round(duplication_ratio, 2), "x"),
comma(max(cluster_stats$unique_seqs)),
comma(max(cluster_stats$total_motifs))
)
) |>
knitr::kable()
}Cluster Size Distribution
Code
if (data_available) {
plot_data <- cluster_stats |>
pivot_longer(
cols = c(unique_seqs, total_motifs),
names_to = "count_type",
values_to = "count"
) |>
mutate(count_type = recode(count_type,
unique_seqs = "Unique Sequences",
total_motifs = "Total Motifs"
))
ggplot(plot_data, aes(x = count, fill = count_type)) +
geom_histogram(bins = 50, alpha = 0.7) +
scale_x_log10(labels = comma) +
scale_y_log10(labels = comma) +
facet_wrap(~count_type, ncol = 2) +
labs(
x = "Cluster Size",
y = "Number of Clusters (log scale)"
) +
theme_minimal() +
theme(legend.position = "none")
}Top Clusters
Code
if (data_available) {
cluster_stats |>
slice_max(unique_seqs, n = 15) |>
mutate(duplication = round(total_motifs / unique_seqs, 2)) |>
rename(
`Cluster ID` = cluster_id,
`Unique Seqs` = unique_seqs,
`Total Motifs` = total_motifs,
Duplication = duplication
) |>
knitr::kable()
}Duplication Analysis
The relationship between unique sequences and total motifs reveals how duplicated each cluster is.
Code
if (data_available) {
multi_clusters <- cluster_stats |>
filter(unique_seqs > 1)
ggplot(multi_clusters, aes(x = unique_seqs, y = total_motifs)) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray50") +
geom_point(alpha = 0.3, size = 1) +
scale_x_log10(labels = comma) +
scale_y_log10(labels = comma) +
labs(
x = "Unique Sequences per Cluster",
y = "Total Motifs per Cluster"
) +
theme_minimal()
}Cluster Size Breakdown
Code
if (data_available) {
cluster_stats |>
mutate(size_category = case_when(
unique_seqs == 1 ~ "1 (singleton)",
unique_seqs <= 5 ~ "2-5",
unique_seqs <= 10 ~ "6-10",
unique_seqs <= 50 ~ "11-50",
unique_seqs <= 100 ~ "51-100",
TRUE ~ ">100"
)) |>
mutate(size_category = factor(size_category,
levels = c("1 (singleton)", "2-5", "6-10", "11-50", "51-100", ">100")
)) |>
group_by(size_category) |>
summarise(
Clusters = n(),
`Unique Seqs` = sum(unique_seqs),
`Total Motifs` = sum(total_motifs),
.groups = "drop"
) |>
mutate(`% Clusters` = round(100 * Clusters / n_clusters, 1)) |>
rename(`Size Category` = size_category) |>
knitr::kable()
}Downstream Analysis
Codon Usage
Stalling peptides may show:
- Rare codon usage upstream of GP
- Specific codon at P position
- mRNA secondary structure
Ribosome Profiling Validation
Compare GP positions to ribosome profiling data:
- Ribosome occupancy peaks
- Pause site mapping
- Translation rate analysis
Output Files
results/prokaryotic/gp_analysis/
├── {database}/
│ ├── all_gp_motifs.tsv.gz # All extracted GP contexts
│ ├── all_gp_sequences.fasta.gz # Full sequences
│ ├── domain_annotations.tsv # Pfam domain hits
│ ├── interdomain_gp.tsv # Filtered inter-domain
│ └── clusters/
│ ├── cluster_assignments.tsv
│ └── cluster_{N}.fasta
Quality Metrics
Filtering Statistics
| Filter | Sequences Removed |
|---|---|
| No GP motif | N/A (pre-filtered) |
| Too short (<45 aa) | TBD |
| Intra-domain | TBD |
| Low-complexity | TBD |
Cluster Quality
| Metric | Threshold | Purpose |
|---|---|---|
| Cluster size | ≥5 sequences | Statistical power |
| Conservation | ≥0.6 | Functional signal |
| Identity | 70% clustering | Reduce redundancy |