This guide provides conventions and commands for AI coding agents working in the scraps repository.
Project: scraps - Single Cell RNA PolyA Site Discovery
Type: Snakemake bioinformatics pipeline for analyzing mRNA polyadenylation sites from single-cell RNA-seq data
Primary Languages: Python 3, R, Snakemake, Shell (zsh)
# Dry-run to validate pipeline (recommended before any changes)
snakemake -npr --configfile config.yaml
# Run pipeline with test data
snakemake --snakefile Snakefile \
--configfile config.yaml \
--resources total_impact=5 \
--keep-going
# Run with specific number of cores
snakemake -j 8 --configfile config.yaml
# Generate DAG visualization
snakemake --dag | dot -Tpdf > dag.pdf
# Always dry-run first to validate Snakemake syntax
snakemake -npr --configfile config.yaml
# Test specific rule
snakemake -npr --configfile config.yaml results/counts/chromiumv2_test_R2_counts.tsv.gz
# List all rules
snakemake --list
# Show reason for rule execution
snakemake -npr --reason --configfile config.yaml
# Create conda environment
conda env create -f scraps_conda.yml
# Activate environment
conda activate scraps_conda
# Update environment after changes
conda env update -n scraps_conda -f scraps_conda.yml
scraps/
├── Snakefile # Main workflow entry point
├── config.yaml # Sample and pipeline configuration
├── chemistry.yaml # Platform-specific chemistry configs
├── scraps_conda.yml # Conda environment specification
├── rules/ # Snakemake rule modules
│ ├── cutadapt_star.snake # Read trimming and alignment
│ ├── count.snake # Feature counting and quantification
│ ├── qc.snake # Quality control reports
│ └── check_versions.snake # Dependency version checks
├── inst/scripts/ # Helper scripts
│ ├── *.py # Python utilities (BAM filtering, etc.)
│ └── R/ # R analysis functions
├── ref/ # Reference files (polyA_DB, etc.)
├── sample_data/ # Test data location
└── results/ # Pipeline outputs (generated)
Imports: Standard library → Third party → Local, grouped and sorted
import os
import re
import argparse
import pysam
import pandas as pd
import numpy as np
Docstrings: Triple-quoted strings describing script/function purpose
""" Filter BAM files to only reads with soft-clipped A tail,
suitable for cellranger and starsolo output
"""
Command-line arguments: Use argparse with descriptive help text
parser.add_argument('-i', '--inbam',
help="Bam file to correct",
required=True)
Naming conventions:
snake_case (e.g., filter_bam_by_A, correct_bam_read1)snake_case (e.g., target_len, filter_cut, single_end)UPPER_CASE if truly constantFile handling: Use context managers for file operations
with open(file_in) as file, gzip.open(file_out, 'wt') as file2:
# process files
Documentation: Roxygen2-style comments for functions
#' Read scraps output from umi_tools to sparseMatrix
#'
#' @param file scraps output table
#' @param n_min minimum number of observations
#' @return count matrix
#' @export
Style: Follow tidyverse conventions
%>% pipe operatordplyr, readr, stringr, tidyr functionssnake_caseDependencies: Import packages explicitly
#' @import readr dplyr stringr tidyr
Shell executable: Pipeline uses zsh (defined in Snakefile line 1)
shell.executable("zsh")
Rule structure: Include all standard sections
rule rulename:
input:
"path/to/input.bam"
output:
temp("path/to/output.bam") # Use temp() for intermediate files
params:
job_name = "rulename",
# Additional parameters
log:
"{results}/logs/{sample}_rulename.txt"
threads:
12
resources:
mem_mb = 8000
shell:
r"""
command --arg {input} > {output} 2> {log}
"""
Key conventions:
r"""...""" for shell blocks2> {log}temp(){sample}, {results}, {read}threads, mem_mbexpand() for generating multiple outputsAccessing config: Use helper functions like _get_config(sample, item)
def _get_config(sample, item):
# Hierarchical lookup: sample -> chemistry[platform] -> chemistry -> defaults
DATA: Directory containing input FASTQsRESULTS: Output directory pathSTAR_INDEX: Path to STAR genome indexPOLYA_SITES: PolyA database reference file (SAF format)DEFAULTS: Default chemistry and platform settingsSAMPLES: Per-sample configuration (basename, chemistry, alignments)Platform-specific configurations organized hierarchically:
chemistry_name:
bc_whitelist: path/to/whitelist
platform_name:
cutadapt_R1: "trimming parameters"
STAR_R1: "alignment parameters"
STAR_R2: "alignment parameters"
rules/verb_target (e.g., assign_sites_R1)SAMPLE_OUTS (Snakefile)snakemake -nprchemistry.yamlcutadapt_*, STAR_*bc_whitelist, bc_cut, bc_length1inst/scripts/chmod +x script.pypython3 inst/scripts/script.pyLog files: All rules write logs to {results}/logs/
Common issues:
scraps_conda.ymlsnakemake -nprDATA path in config.yamlmem_mb or threads in rulesDebugging Snakemake:
# Show detailed execution plan
snakemake -npr --verbose
# Print shell commands without execution
snakemake -np --printshellcmds
# Force re-run specific rule
snakemake --forcerun rulename
Core requirements (installed via conda):
Version checking: Run snakemake --configfile config.yaml to trigger version checks
snakemake -npr before any pipeline changeszsh, not bashtemp() directive