GPU Jobs¶

The gpu partition provides access to Bodhi's GPU nodes. This page covers the hardware, how to submit jobs, and the per-partition limits you need to work within.

Hardware¶

Node	CPUs	GPUs	GPU model
`compgpu01`	64	4	NVIDIA A30
`compgpu03`	64	4	NVIDIA A30

Total: 2 nodes, 128 CPUs, 8 A30 GPUs.

Check the live state anytime:

sinfo -p gpu -o "%N %c %G %t"
scontrol show node compgpu01 | grep -E "CPUTot|Gres|RealMemory"

Partition settings¶

Setting	Value	Notes
Default runtime	`12:00:00`	If you don't specify `--time`, you get 12 hours
Max runtime	Set by QOS	Use `--qos=long` for extended runs
Allowed QOS	`normal`, `long`	Default QOS is `normal`
Default memory	`12 GB / node`	Override with `--mem`
Default CPUs per GPU	`16`	If you don't set `--cpus-per-task`, you get 16 CPUs for each GPU you request
Max CPUs per node per job	`16`	A single job cannot exceed 16 CPUs on one gpu node — request additional GPUs (and more nodes) for more CPUs
Default partition?	No	You must pass `-p gpu` explicitly

Account required

The gpu partition is restricted by account. You must submit with -A <account> and your account must be on the partition's allow-list. Running -p gpu without a permitted account will be rejected. See Requesting access below.

How to submit¶

Interactive GPU session (`srun`)¶

srun -p gpu -A <your_account> \
     --gres=gpu:1 \
     --cpus-per-task=4 \
     --mem=32G \
     --time=04:00:00 \
     --pty bash

Inside the shell, confirm the GPU is visible:

nvidia-smi
echo "CUDA_VISIBLE_DEVICES=$CUDA_VISIBLE_DEVICES"

Batch job (`sbatch`)¶

#!/bin/bash
#SBATCH --job-name=gpu_train
#SBATCH --partition=gpu
#SBATCH --account=<your_account>
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=4
#SBATCH --mem=32G
#SBATCH --time=12:00:00
#SBATCH --output=logs/gpu_train.%j.out
#SBATCH --error=logs/gpu_train.%j.err

module load cuda/12.2
source ~/venvs/torch/bin/activate

nvidia-smi
python train.py --data /data/input --out /data/output

Submit:

mkdir -p logs
sbatch gpu_job.sh

Longer jobs (`--qos=long`)¶

The default QOS (normal) imposes a shorter wall-time ceiling. For extended training runs, request the long QOS:

sbatch --qos=long --time=72:00:00 -p gpu -A <your_account> gpu_job.sh

Requesting more than one GPU¶

Two GPUs on a single node, with CPUs scaled automatically via DefCpuPerGPU:

sbatch -p gpu -A <your_account> --gres=gpu:2 gpu_job.sh
# gets 2 GPUs + 32 CPUs by default

All GPUs on Bodhi are currently NVIDIA A30s, so --gres=gpu:N is sufficient — there is no need to name a specific model.

Limits to keep in mind¶

CPU cap per job per node is 16. A multi-GPU job that wants more total CPUs must spread across both gpu nodes (--nodes=2 --ntasks-per-node=...) or stay within the 16-CPU-per-node cap.
Memory default is low (12 GB). Always set --mem explicitly for real workloads.
Per-account GPU caps exist. Some accounts are limited to a shared pool of GPUs (e.g., 1 concurrent GPU for the whole account). If your job is stuck in PENDING with reason QOSGrpGRES or AssocGrpGRES, another user on your account is already holding the group's GPUs.

Requesting access¶

GPU access is granted through a Slurm account. If you don't yet have one:

Contact an administrator to request access. Provide your username and a short description of the workload.
The admin will add you to an existing GPU account (e.g., gpu_rbi) or create a new one for your group.
Once added, pass -A <account_name> on every GPU submission.

You can list the accounts you belong to with:

sacctmgr show assoc user=$USER format=User,Account,DefaultAccount,Partition

Admins: see GPU partition configuration for provisioning details.

Monitoring your GPU jobs¶

# Your pending/running jobs
squeue -u $USER

# All jobs currently charging a given account
squeue -A <account_name>

# Historical usage with allocated GPUs
sacct -u $USER -X --format=JobID,JobName,Partition,Account,AllocTRES%40,State,Elapsed

# Live GPU utilization on the node your job landed on
srun --jobid=<jobid> --pty nvidia-smi

After a job ends, seff <jobid> summarizes CPU and memory efficiency (GPU efficiency is not reported there — use sacct with AllocTRES and your own training-time metrics).

Troubleshooting¶

Symptom	Likely cause
`Invalid account or account/partition combination`	Your account is not on `gpu`'s `AllowAccounts` list — ask an admin
Job stuck `PENDING`, reason `QOSGrpGRES`	Another job on your account is holding the group's GPU quota
Job stuck `PENDING`, reason `Resources`	No GPU currently free — wait or request fewer
`Requested node configuration is not available`	You asked for more CPUs than `MaxCPUsPerNode=16` on one node, or an unknown GPU model
`nvidia-smi` shows no GPU	You forgot `--gres=gpu:N` — the `gpu` partition does not auto-allocate GPUs