Common Pain Points¶
This page covers recurring issues that Bodhi users encounter when migrating from LSF to SLURM. These aren't simple directive swaps — they're behavioral differences that catch people off guard.
Debugging OOM (Out-of-Memory) errors¶
How OOM kills look in SLURM¶
When a job exceeds its memory allocation, SLURM kills it immediately. The job state is set to OUT_OF_MEMORY:
$ sacct -j 12345 --format=JobID,JobName,State,ExitCode,MaxRSS
JobID JobName State ExitCode MaxRSS
------------ ---------- ---------- -------- ----------
12345 analysis OUT_OF_ME+ 0:125
12345.batch batch OUT_OF_ME+ 0:125 15.8G
You can also see this with seff:
$ seff 12345
Job ID: 12345
State: OUT_OF_MEMORY (exit code 0)
Memory Utilized: 15.80 GB
Memory Efficiency: 98.75% of 16.00 GB
This is different from LSF
On Bodhi's LSF, memory limits were often soft limits — jobs could exceed their requested memory without being killed (as long as the node had memory available). In SLURM, --mem is a hard limit enforced by cgroups. If your job exceeds it, even briefly, it will be killed.
Diagnosing memory usage¶
For completed jobs, use sacct:
# Check peak memory usage
sacct -j <jobid> --format=JobID,JobName,MaxRSS,MaxVMSize,State
# For array jobs, check all tasks
sacct -j <jobid> --format=JobID%20,JobName,MaxRSS,State
For running jobs, use sstat:
Use seff for quick checks
seff <jobid> gives a one-line summary of memory efficiency for completed jobs. It's the fastest way to check if your job was close to its memory limit.
Fixing OOM errors¶
-
Check what your job actually used — run
seff <jobid>on a similar completed job to see actual peak memory. -
Request more memory with headroom — add 20–30% buffer above the observed peak:
-
Use
--mem-per-cpufor multi-threaded jobs — if your job scales memory with cores:
Default memory when --mem is not specified
Bodhi's default is DefMemPerCPU=4000 (4 GB per CPU). So a job requesting --cpus-per-task=4 with no --mem gets 16 GB total. A single-CPU job gets 4 GB.
Don't just request the maximum
Requesting far more memory than you need reduces scheduling priority and wastes cluster resources. Right-size your requests based on actual usage from seff.
Tasks vs CPUs: --ntasks vs --cpus-per-task¶
LSF's -n N meant one thing: "give me N cores." SLURM splits that same idea into two separate flags, and picking the wrong one is one of the most common migration mistakes on Bodhi.
The framing difference¶
SLURM distinguishes between processes and threads within a process:
--ntasks=N— how many independent processes your job will launch (think MPI ranks, or N separate commands)--cpus-per-task=N— how many threads/cores each of those processes will use
The total cores allocated to your job is ntasks × cpus-per-task. Both flags default to 1, so a bare sbatch with no CPU flags gets you exactly one core.
This is different from LSF
LSF -n 8 almost always maps to --cpus-per-task=8, not --ntasks=8. Almost all bioinformatics tools (samtools, bwa, STAR, R mclapply, anything using OpenMP) are threaded, not MPI. If you set --ntasks=8 for a threaded tool, SLURM reserves 8 slots but your tool only uses one of them — you get slower scheduling and worse performance.
Which flag do I use?¶
| Your job | Use | Example |
|---|---|---|
| Single-threaded serial | neither (defaults) | python script.py |
Multi-threaded (-@, -t, --threads, OpenMP) |
--cpus-per-task=N |
samtools sort -@ 8 |
MPI (mpirun, mpiexec) |
--ntasks=N |
mpirun ./mpi_app |
Embarrassingly parallel via srun or GNU parallel |
--ntasks=N |
N independent shell commands |
| Hybrid MPI + threads | both | --ntasks=4 --cpus-per-task=8 |
Behavioral consequences¶
What srun does with each flag. srun cmd with --ntasks=N launches N copies of cmd in parallel. srun cmd with --cpus-per-task=N --ntasks=1 launches one copy of cmd and hands it N cores — cmd itself is responsible for spawning threads to use them.
What environment variable your tool sees. SLURM exports $SLURM_NTASKS and $SLURM_CPUS_PER_TASK. A common idiom for threaded tools:
#SBATCH --cpus-per-task=8
#SBATCH --mem=16G
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
samtools sort -@ $SLURM_CPUS_PER_TASK input.bam -o sorted.bam
Node placement. --ntasks tasks may be spread across multiple nodes unless you also set --nodes=1. --cpus-per-task cores are always on the same node — a single process can't span nodes.
Memory interaction¶
--mem is per-job (total). --mem-per-cpu multiplies by the total cores (ntasks × cpus-per-task):
Quick self-check
- If your tool has a
--threads,-t,-@, or-pflag, you want--cpus-per-task=N. - If you're launching
mpirunormpiexec, you want--ntasks=N. - If in doubt,
--cpus-per-taskis the right default for typical bioinformatics work on Bodhi.
Understanding SLURM accounts¶
What is --account?¶
In SLURM, the --account flag associates your job with a resource allocation account. This is used for:
- Fair-share scheduling — accounts that have used fewer resources recently get higher priority
- Resource tracking — PIs and admins can see how allocations are consumed
- Access control — some partitions may be restricted to certain accounts
Why this matters on Bodhi
On LSF, the -P project flag was often optional or had a simple default. On SLURM, submitting with the wrong account (or no account) can result in job rejection or lower scheduling priority.
Finding your account(s)¶
# List your SLURM associations (accounts and partitions you can use)
sacctmgr show associations user=$USER format=Account,Partition,QOS
# Shorter version — just account names
sacctmgr show associations user=$USER format=Account --noheader | sort -u
Bodhi accounts are lab/group-based. Each account corresponds to a research group or resource class:
| Account | Description |
|---|---|
bmg |
Biochemistry and Molecular Genetics |
rbi |
RNA Bioscience Initiative |
jones |
Jones lab (Pediatrics) |
genome |
Genome group |
scb |
SCB group (SOM Hematology) |
gpu_rbi |
GPU access for RBI |
gpu_scb |
GPU access for SCB |
bigmem |
Large-memory node access |
cranio |
Craniofacial group |
normal |
General/shared access |
peds_devbio |
Pediatrics Developmental Biology |
peds_hematology |
Pediatrics Hematology |
som_hematology |
SOM Hematology |
som_dermatology |
SOM Dermatology |
medical_oncology |
Medical Oncology |
gastroenterology |
Gastroenterology |
Most users are associated with their PI's lab account. You may belong to multiple accounts (e.g., rbi for CPU jobs and gpu_rbi for GPU jobs).
Setting a default account¶
Rather than adding --account to every script, set a default:
# Set your default account (persists across sessions)
sacctmgr modify user $USER set DefaultAccount=<your_account>
You can also add it to your ~/.bashrc or a SLURM defaults file:
In your job scripts¶
--account is effectively required on Bodhi
Bodhi enforces AccountingStorageEnforce=associations,limits,qos, which means jobs are rejected if your user lacks a valid account association for the target partition and QoS. If you have only one account, Slurm uses it automatically. If you have multiple accounts, set a default (see above) to avoid specifying --account on every submission.
Paying attention to wall time¶
SLURM enforces --time strictly¶
In SLURM, the --time (wall time) limit is a hard cutoff. When your job hits the limit:
- SLURM sends
SIGTERMto your job (giving it a chance to clean up) - After a 30-second grace period (
KillWait=30), SLURM sendsSIGKILL - The job state is set to
TIMEOUT
$ sacct -j 12345 --format=JobID,JobName,Elapsed,Timelimit,State
JobID JobName Elapsed Timelimit State
------------ ---------- ---------- ---------- ----------
12345 longrun 02:00:00 02:00:00 TIMEOUT
This is different from LSF
On Bodhi's LSF, wall-time limits were often loosely enforced — jobs could sometimes run past their -W limit. In SLURM, when your time is up, your job is killed. Period.
Checking remaining time¶
From outside the job:
# See time limit and elapsed time
squeue -u $USER -o "%.10i %.20j %.10M %.10l %.6D %R"
# Elapsed ^ ^ Limit
# Detailed view
scontrol show job <jobid> | grep -E "RunTime|TimeLimit"
From inside the job (in your script):
Consequences of TIMEOUT¶
- Your job output may be incomplete or corrupted
- Any files being written at kill time may be truncated
- Temporary files won't be cleaned up
Add cleanup traps
If your job writes large intermediate files, add a trap to handle SIGTERM:
Bodhi partition time limits¶
| Partition | Max wall time | Default wall time | Nodes | Access | Notes |
|---|---|---|---|---|---|
normal |
3 days | 4 hours | compute01–04, 06–07, 14 | All accounts | Default partition |
interactive |
1 day | 8 hours | compute03–04, 06–07 | All accounts | Max 3 jobs/user |
rna |
3 days | 4 hours | compute07–09, 15–20 | rbi |
Falls back to normal |
jones |
3 days | 4 hours | compute04–05, 10–12 | jones |
|
genome |
3 days | 4 hours | compute06–09 | genome |
Falls back to normal |
gpu |
3 days | 12 hours | compgpu01, 03 | gpu_rbi |
8× NVIDIA A30 |
scb_gpu |
3 days | 12 hours | compgpu02 | gpu_scb |
4× NVIDIA A30 |
scb |
3 days | 4 hours | compute13 | scb |
|
cranio |
3 days | 4 hours | compute21 | scb |
Falls back to normal |
bigmem |
3 days | 4 hours | compute14 | bigmem |
~1.5 TB RAM |
rstudio |
3 days | 8 hours | compute00 | bigmem |
Interactive RStudio |
voila |
3 days | 4 hours | compute00 | bigmem |
Voilà notebooks |
Default wall time changed — jobs may time out
If you omit --time, your job now gets 4 hours (general partitions) or 12 hours (GPU partitions). Previously, jobs without --time silently inherited the 3-day maximum.
If your jobs are timing out, add --time with a realistic estimate:
For jobs that need more than 3 days, use the long QoS (up to 7 days):
Why the change? Shorter default times dramatically improve scheduling. SLURM's backfill scheduler can only fit jobs into gaps if it knows when running jobs will end. A job with no --time previously looked like a 3-day job to the scheduler — even if it finished in 20 minutes — blocking other jobs from backfilling into the gap.
Right-size your --time requests
Request about 20–30% more than your expected runtime. Use seff <jobid> to check how long past jobs actually took. Shorter time requests schedule faster via backfill.
Check current limits
Partition limits can change. Verify the current limits with:
Tips for setting wall time¶
-
Start with a generous estimate, then refine based on actual runtimes using
sefforsacct. -
Shorter jobs schedule faster — SLURM's backfill scheduler can fit shorter jobs into gaps. Requesting 2 hours instead of 7 days can dramatically reduce queue wait time.
-
Use
sacctto check past runtimes: -
SLURM format for
--time:Format Meaning MMMinutes HH:MM:SSHours, minutes, seconds D-HH:MM:SSDays, hours, minutes, seconds D-HHDays and hours
QoS tiers¶
SLURM Quality of Service (QoS) tiers control job priority, time limits, and resource caps. Every job runs under a QoS — if you don't specify one, you get normal.
Available QoS tiers¶
| QoS | Priority | Max wall time | Max running jobs | Max queued jobs | Resource caps | Use case |
|---|---|---|---|---|---|---|
high |
100 | partition limit | 10 | 20 | — | Urgent jobs — requires admin approval |
long |
50 | 7 days | 5 | 50 | 32 CPUs | Extended runs that need >1 day |
interactive |
50 | 12 hours | 3 | 3 | 16 CPUs, 8 GB | Terminal sessions via sinteractive |
normal |
25 | 1 day | 100 | 500 | — | Default for most jobs |
low |
10 | 7 days | 50 | 200 | — | Background, non-urgent work |
How priority works
Higher priority QoS tiers are scheduled first. The high QoS can also preempt (requeue) running normal jobs when resources are needed.
Using a QoS¶
Or on the command line:
When to use long¶
The normal QoS has a 1-day time limit. If your job needs more time, use long:
long has OverPartQOS — it overrides partition time limits, so you can run >1 day jobs on any partition. The tradeoff: you're limited to 5 running jobs and 32 CPUs total under long.
When to use low¶
For jobs that can wait — overnight runs, large batch submissions where turnaround isn't critical:
low allows up to 7 days and 50 concurrent jobs, but at reduced priority.
DenyOnLimit¶
All QoS tiers enforce DenyOnLimit — if your job exceeds the QoS limits (too many running jobs, too many CPUs, etc.), it is rejected at submit time rather than sitting in the queue forever. You'll see an immediate error telling you what limit was hit.