GP Motif Analysis

Comprehensive analysis of Gly-Pro motifs in prokaryotic proteins

Published

January 18, 2026

Overview

The Gly-Pro (GP) dipeptide is central to ribosomal stalling. This analysis examines GP motif distribution, context, and conservation across prokaryotic proteomes.

GP Motif Extraction

Parameters

prokaryotic:
  gp_upstream: 30      # Residues upstream of GP
  gp_downstream: 15    # Residues downstream of GP

Expected Distribution

GP motifs occur throughout proteins but functional stalling sites have distinctive features:

Feature	Random GP	Stalling GP
Position	Any	Inter-domain
Upstream	Variable	Conserved pattern
Downstream	Variable	Often P or charged

Context Analysis

Upstream Patterns

Functional stalling peptides show conserved patterns upstream of GP:

-30    -20    -10    GP
 |      |      |     ||
 L---F--W------R/K--AGP  (SecM-like)
 W------E------G----GP   (TnaC-like)

Position-Specific Conservation

Position	SecM-like	TnaC-like	Random
-1 (A/G)	High	Medium	Low
-2 (R/K)	High	Low	Low
-3	Variable	Variable	Variable
-10 (W)	High	High	Low

Domain Context

Inter-Domain vs Intra-Domain

GP motifs are classified by their domain context:

flowchart LR
    subgraph Protein
        D1[Domain A] --> L[Linker + GP] --> D2[Domain B]
    end

    subgraph Types
        T1[Inter-domain GP]
        T2[Intra-domain GP]
        T3[Terminal GP]
    end

Figure 1: GP motif domain contexts

Inter-domain: Between two annotated Pfam domains
Intra-domain: Within a single domain
Terminal: Near N- or C-terminus

Enrichment Analysis

Functional stalling GPs are enriched in inter-domain regions:

Context	All GP	Stalling GP	Enrichment
Inter-domain	15%	60%	4x
Intra-domain	70%	25%	0.4x
Terminal	15%	15%	1x

Clustering Results

GP motif sequences were clustered using MMseqs2 at 70% sequence identity to identify conserved families.

Code

library(tidyverse)
library(scales)

cluster_file <- here::here("results/prokaryotic/gp_analysis/bacteria/gp_clusters.tsv.gz")

if (file.exists(cluster_file)) {
  clusters_df <- read_tsv(cluster_file, show_col_types = FALSE)
  data_available <- TRUE
} else {
  data_available <- FALSE
  message("Cluster data not found at ", cluster_file)
}

Summary Statistics

Code

if (data_available) {
  n_total_motifs <- nrow(clusters_df)
  n_unique_seqs <- n_distinct(clusters_df$context_sequence)
  n_clusters <- n_distinct(clusters_df$cluster_id)
  duplication_ratio <- n_total_motifs / n_unique_seqs

  cluster_stats <- clusters_df |>
    group_by(cluster_id) |>
    summarise(
      total_motifs = n(),
      unique_seqs = n_distinct(context_sequence),
      .groups = "drop"
    )

  n_singletons <- sum(cluster_stats$unique_seqs == 1)
  pct_singletons <- 100 * n_singletons / n_clusters

  tibble(
    Metric = c(
      "Total GP motifs",
      "Unique sequences",
      "Total clusters",
      "Singleton clusters",
      "Duplication ratio",
      "Largest cluster (unique)",
      "Largest cluster (total)"
    ),
    Value = c(
      comma(n_total_motifs),
      comma(n_unique_seqs),
      comma(n_clusters),
      paste0(comma(n_singletons), " (", round(pct_singletons, 1), "%)"),
      paste0(round(duplication_ratio, 2), "x"),
      comma(max(cluster_stats$unique_seqs)),
      comma(max(cluster_stats$total_motifs))
    )
  ) |>
    knitr::kable()
}

Cluster Size Distribution

Code

if (data_available) {
  plot_data <- cluster_stats |>
    pivot_longer(
      cols = c(unique_seqs, total_motifs),
      names_to = "count_type",
      values_to = "count"
    ) |>
    mutate(count_type = recode(count_type,
      unique_seqs = "Unique Sequences",
      total_motifs = "Total Motifs"
    ))

  ggplot(plot_data, aes(x = count, fill = count_type)) +
    geom_histogram(bins = 50, alpha = 0.7) +
    scale_x_log10(labels = comma) +
    scale_y_log10(labels = comma) +
    facet_wrap(~count_type, ncol = 2) +
    labs(
      x = "Cluster Size",
      y = "Number of Clusters (log scale)"
    ) +
    theme_minimal() +
    theme(legend.position = "none")
}

Top Clusters

Code

if (data_available) {
  cluster_stats |>
    slice_max(unique_seqs, n = 15) |>
    mutate(duplication = round(total_motifs / unique_seqs, 2)) |>
    rename(
      `Cluster ID` = cluster_id,
      `Unique Seqs` = unique_seqs,
      `Total Motifs` = total_motifs,
      Duplication = duplication
    ) |>
    knitr::kable()
}

Duplication Analysis

The relationship between unique sequences and total motifs reveals how duplicated each cluster is.

Code

if (data_available) {
  multi_clusters <- cluster_stats |>
    filter(unique_seqs > 1)

  ggplot(multi_clusters, aes(x = unique_seqs, y = total_motifs)) +
    geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "gray50") +
    geom_point(alpha = 0.3, size = 1) +
    scale_x_log10(labels = comma) +
    scale_y_log10(labels = comma) +
    labs(
      x = "Unique Sequences per Cluster",
      y = "Total Motifs per Cluster"
    ) +
    theme_minimal()
}

Cluster Size Breakdown

Code

if (data_available) {
  cluster_stats |>
    mutate(size_category = case_when(
      unique_seqs == 1 ~ "1 (singleton)",
      unique_seqs <= 5 ~ "2-5",
      unique_seqs <= 10 ~ "6-10",
      unique_seqs <= 50 ~ "11-50",
      unique_seqs <= 100 ~ "51-100",
      TRUE ~ ">100"
    )) |>
    mutate(size_category = factor(size_category,
      levels = c("1 (singleton)", "2-5", "6-10", "11-50", "51-100", ">100")
    )) |>
    group_by(size_category) |>
    summarise(
      Clusters = n(),
      `Unique Seqs` = sum(unique_seqs),
      `Total Motifs` = sum(total_motifs),
      .groups = "drop"
    ) |>
    mutate(`% Clusters` = round(100 * Clusters / n_clusters, 1)) |>
    rename(`Size Category` = size_category) |>
    knitr::kable()
}

Downstream Analysis

Codon Usage

Stalling peptides may show:

Rare codon usage upstream of GP
Specific codon at P position
mRNA secondary structure

Ribosome Profiling Validation

Compare GP positions to ribosome profiling data:

Ribosome occupancy peaks
Pause site mapping
Translation rate analysis

Output Files

results/prokaryotic/gp_analysis/
├── {database}/
│   ├── all_gp_motifs.tsv.gz      # All extracted GP contexts
│   ├── all_gp_sequences.fasta.gz # Full sequences
│   ├── domain_annotations.tsv    # Pfam domain hits
│   ├── interdomain_gp.tsv        # Filtered inter-domain
│   └── clusters/
│       ├── cluster_assignments.tsv
│       └── cluster_{N}.fasta

Quality Metrics

Filtering Statistics

Filter	Sequences Removed
No GP motif	N/A (pre-filtered)
Too short (<45 aa)	TBD
Intra-domain	TBD
Low-complexity	TBD

Cluster Quality

Metric	Threshold	Purpose
Cluster size	≥5 sequences	Statistical power
Conservation	≥0.6	Functional signal
Identity	70% clustering	Reduce redundancy