GP Motif Analysis

Comprehensive analysis of Gly-Pro motifs in prokaryotic proteins

Published

January 18, 2026

Overview

The Gly-Pro (GP) dipeptide is central to ribosomal stalling. This analysis examines GP motif distribution, context, and conservation across prokaryotic proteomes.

GP Motif Extraction

Parameters

prokaryotic:
  gp_upstream: 30      # Residues upstream of GP
  gp_downstream: 15    # Residues downstream of GP

Expected Distribution

GP motifs occur throughout proteins but functional stalling sites have distinctive features:

Feature Random GP Stalling GP
Position Any Inter-domain
Upstream Variable Conserved pattern
Downstream Variable Often P or charged

Context Analysis

Upstream Patterns

Functional stalling peptides show conserved patterns upstream of GP:

-30    -20    -10    GP
 |      |      |     ||
 L---F--W------R/K--AGP  (SecM-like)
 W------E------G----GP   (TnaC-like)

Position-Specific Conservation

Position SecM-like TnaC-like Random
-1 (A/G) High Medium Low
-2 (R/K) High Low Low
-3 Variable Variable Variable
-10 (W) High High Low

Domain Context

Inter-Domain vs Intra-Domain

GP motifs are classified by their domain context:

flowchart LR
    subgraph Protein
        D1[Domain A] --> L[Linker + GP] --> D2[Domain B]
    end

    subgraph Types
        T1[Inter-domain GP]
        T2[Intra-domain GP]
        T3[Terminal GP]
    end
Figure 1: GP motif domain contexts
  • Inter-domain: Between two annotated Pfam domains
  • Intra-domain: Within a single domain
  • Terminal: Near N- or C-terminus

Enrichment Analysis

Functional stalling GPs are enriched in inter-domain regions:

Context All GP Stalling GP Enrichment
Inter-domain 15% 60% 4x
Intra-domain 70% 25% 0.4x
Terminal 15% 15% 1x

Clustering Results

GP motif sequences were clustered using MMseqs2 at 70% sequence identity to identify conserved families.

Code
library(tidyverse)
library(scales)

cluster_file <- here::here("results/prokaryotic/gp_analysis/bacteria/gp_clusters.tsv.gz")

if (file.exists(cluster_file)) {
  clusters_df <- read_tsv(cluster_file, show_col_types = FALSE)
  data_available <- TRUE
} else {
  data_available <- FALSE
  message("Cluster data not found at ", cluster_file)
}

Summary Statistics

Code
if (data_available) {
  n_total_motifs <- nrow(clusters_df)
  n_unique_seqs <- n_distinct(clusters_df$context_sequence)
  n_clusters <- n_distinct(clusters_df$cluster_id)
  duplication_ratio <- n_total_motifs / n_unique_seqs

  cluster_stats <- clusters_df |>
    group_by(cluster_id) |>
    summarise(
      total_motifs = n(),
      unique_seqs = n_distinct(context_sequence),
      .groups = "drop"
    )

  n_singletons <- sum(cluster_stats$unique_seqs == 1)
  pct_singletons <- 100 * n_singletons / n_clusters

  tibble(
    Metric = c(
      "Total GP motifs",
      "Unique sequences",
      "Total clusters",
      "Singleton clusters",
      "Duplication ratio",
      "Largest cluster (unique)",
      "Largest cluster (total)"
    ),
    Value = c(
      comma(n_total_motifs),
      comma(n_unique_seqs),
      comma(n_clusters),
      paste0(comma(n_singletons), " (", round(pct_singletons, 1), "%)"),
      paste0(round(duplication_ratio, 2), "x"),
      comma(max(cluster_stats$unique_seqs)),
      comma(max(cluster_stats$total_motifs))
    )
  ) |>
    knitr::kable()
}

Cluster Size Distribution

Code
if (data_available) {
  plot_data <- cluster_stats |>
    pivot_longer(
      cols = c(unique_seqs, total_motifs),
      names_to = "count_type",
      values_to = "count"
    ) |>
    mutate(count_type = recode(count_type,
      unique_seqs = "Unique Sequences",
      total_motifs = "Total Motifs"
    ))

  ggplot(plot_data, aes(x = count, fill = count_type)) +
    geom_histogram(bins = 50, alpha = 0.7) +
    scale_x_log10(labels = comma) +
    scale_y_log10(labels = comma) +
    facet_wrap(~count_type, ncol = 2) +
    labs(
      x = "Cluster Size",
      y = "Number of Clusters (log scale)"
    ) +
    theme_minimal() +
    theme(legend.position = "none")
}

Top Clusters

Code
if (data_available) {
  cluster_stats |>
    slice_max(unique_seqs, n = 15) |>
    mutate(duplication = round(total_motifs / unique_seqs, 2)) |>
    rename(
      `Cluster ID` = cluster_id,
      `Unique Seqs` = unique_seqs,
      `Total Motifs` = total_motifs,
      Duplication = duplication
    ) |>
    knitr::kable()
}

Duplication Analysis

The relationship between unique sequences and total motifs reveals how duplicated each cluster is.

Code
if (data_available) {
  multi_clusters <- cluster_stats |>
    filter(unique_seqs > 1)

  ggplot(multi_clusters, aes(x = unique_seqs, y = total_motifs)) +
    geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray50") +
    geom_point(alpha = 0.3, size = 1) +
    scale_x_log10(labels = comma) +
    scale_y_log10(labels = comma) +
    labs(
      x = "Unique Sequences per Cluster",
      y = "Total Motifs per Cluster"
    ) +
    theme_minimal()
}

Cluster Size Breakdown

Code
if (data_available) {
  cluster_stats |>
    mutate(size_category = case_when(
      unique_seqs == 1 ~ "1 (singleton)",
      unique_seqs <= 5 ~ "2-5",
      unique_seqs <= 10 ~ "6-10",
      unique_seqs <= 50 ~ "11-50",
      unique_seqs <= 100 ~ "51-100",
      TRUE ~ ">100"
    )) |>
    mutate(size_category = factor(size_category,
      levels = c("1 (singleton)", "2-5", "6-10", "11-50", "51-100", ">100")
    )) |>
    group_by(size_category) |>
    summarise(
      Clusters = n(),
      `Unique Seqs` = sum(unique_seqs),
      `Total Motifs` = sum(total_motifs),
      .groups = "drop"
    ) |>
    mutate(`% Clusters` = round(100 * Clusters / n_clusters, 1)) |>
    rename(`Size Category` = size_category) |>
    knitr::kable()
}

Downstream Analysis

Codon Usage

Stalling peptides may show:

  • Rare codon usage upstream of GP
  • Specific codon at P position
  • mRNA secondary structure

Ribosome Profiling Validation

Compare GP positions to ribosome profiling data:

  • Ribosome occupancy peaks
  • Pause site mapping
  • Translation rate analysis

Output Files

results/prokaryotic/gp_analysis/
├── {database}/
│   ├── all_gp_motifs.tsv.gz      # All extracted GP contexts
│   ├── all_gp_sequences.fasta.gz # Full sequences
│   ├── domain_annotations.tsv    # Pfam domain hits
│   ├── interdomain_gp.tsv        # Filtered inter-domain
│   └── clusters/
│       ├── cluster_assignments.tsv
│       └── cluster_{N}.fasta

Quality Metrics

Filtering Statistics

Filter Sequences Removed
No GP motif N/A (pre-filtered)
Too short (<45 aa) TBD
Intra-domain TBD
Low-complexity TBD

Cluster Quality

Metric Threshold Purpose
Cluster size ≥5 sequences Statistical power
Conservation ≥0.6 Functional signal
Identity 70% clustering Reduce redundancy