Submitting Jobs on PERUN – Complete Guide with Automatic Scratch¶
This guide explains practical Slurm batch scripts for the PERUN supercomputer with the automatic scratch system.
What's New
- Automatic scratch management (prolog/epilog)
- Fast I/O on Lustre instead of slow NFS
- Automatic data staging and result synchronization
- One-line activation - just add
source .activate_scratch
Table of Contents¶
- Quick Start - Automatic Scratch
- How Automatic Scratch Works
- Basic Job Templates
- Advanced Examples
- Performance Comparison
- Troubleshooting
- Slurm Basics
1. Quick Start - Automatic Scratch¶
The Old Way (Manual - 150+ lines of boilerplate)¶
#!/bin/bash
#SBATCH --partition=GPU
#SBATCH --gres=gpu:2
# Manual scratch setup (tedious!)
SCRATCH="/lustre/scratch/$USER/job_${SLURM_JOB_ID}"
mkdir -p "$SCRATCH"
cp -r "$SLURM_SUBMIT_DIR"/* "$SCRATCH/"
cd "$SCRATCH"
# Your code
python3 train.py
# Manual cleanup
cp -r output/ "$SLURM_SUBMIT_DIR/"
rm -rf "$SCRATCH"
The New Way (Automatic - ONE line!)¶
#!/bin/bash
#SBATCH --partition=GPU
#SBATCH --gres=gpu:2
# Activate scratch (ONE LINE!)
source .activate_scratch
# Your code - runs in fast Lustre scratch!
python3 train.py
# Done! Results automatically synced to ~/results_job_XXXXX/
Simplification
That's it! 98% less boilerplate code.
2. How Automatic Scratch Works¶
The Three Phases¶
Phase 1: PROLOG (Before Your Job)¶
Automatic Execution
Runs automatically, you don't do anything.
1. Creates /lustre/scratch/$USER/job_$JOBID/
2. Copies your ENTIRE submit directory to scratch
3. Creates .activate_scratch helper file
What gets copied:
Files Included in Copy
Included:
- All
.py,.sh,.txtfiles - Subdirectories (
data/,models/, etc.) - Configuration files
Excluded:
- Hidden files (
.git/,.venv/) - Output files (
*.out,*.err) __pycache__/directories
Phase 2: YOUR JOB (Your Code)¶
One Line Addition
You add ONE line:
This does:
cd /lustre/scratch/$USER/job_$JOBID
export SCRATCH_DIR="$(pwd)"
export DATA_DIR="$SCRATCH_DIR/data"
export TMPDIR="$SCRATCH_DIR/tmp"
export RESULTS_DIR="$SCRATCH_DIR/results"
Now your job runs in fast Lustre scratch instead of slow NFS home.
Phase 3: EPILOG (After Your Job)¶
Automatic Cleanup
Runs automatically, you don't do anything.
1. Syncs EVERYTHING from scratch → ~/results_job_$JOBID/
2. Creates job summary file
3. Cleans up scratch automatically
Result
All your outputs, checkpoints, logs safely in your home directory.
3. Basic Job Templates¶
3.1 Single GPU Training¶
#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --partition=GPU
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=48G
#SBATCH --time=24:00:00
# Activate scratch (ONE LINE!)
source .activate_scratch
# Your training code
python3 train.py \
--data data/dataset.csv \
--checkpoint checkpoints/ \
--output results/
# Done! Epilog automatically syncs:
# - checkpoints/ → ~/results_job_XXXXX/checkpoints/
# - results/ → ~/results_job_XXXXX/results/
# - logs/ → ~/results_job_XXXXX/logs/
3.2 Multi-GPU Training (DDP)¶
#!/bin/bash
#SBATCH --job-name=train_ddp
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --gres=gpu:4
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=8
#SBATCH --mem=128G
#SBATCH --time=48:00:00
# Activate scratch
source .activate_scratch
# DDP training
srun python3 -m torch.distributed.run \
--nproc_per_node=4 \
train_ddp.py
# Results automatically synced!
3.3 CPU-Only Job¶
#!/bin/bash
#SBATCH --job-name=preprocess
#SBATCH --output=%x_%j.out
#SBATCH --partition=CPU
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=04:00:00
# Activate scratch
source .activate_scratch
# CPU preprocessing
python3 preprocess_data.py \
--input data/raw/ \
--output data/processed/
# Processed data automatically synced!
3.4 Large Dataset Job (1TB+)¶
#!/bin/bash
#SBATCH --job-name=big_data
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=72:00:00
# Activate scratch
source .activate_scratch
# Large dataset - benefits from fast Lustre I/O
python3 train_large.py \
--data /lustre/datasets/imagenet/ \
--checkpoint $SCRATCH_DIR/checkpoints/ \
--batch_size 1024
# Only final checkpoints synced (saves time)
4. Advanced Examples¶
4.1 Multi-Node DDP Training¶
#!/bin/bash
#SBATCH --job-name=ddp_multinode
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --nodes=4
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=96:00:00
# Activate scratch
source .activate_scratch
# Setup master node
MASTER_ADDR=$(scontrol show hostnames "$SLURM_NODELIST" | head -n 1)
MASTER_PORT=29500
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
echo "Training on $SLURM_NNODES nodes, $SLURM_NTASKS GPUs total"
echo "Master: $MASTER_ADDR:$MASTER_PORT"
# Launch distributed training
srun python3 -m torch.distributed.run \
--nnodes=$SLURM_NNODES \
--nproc_per_node=4 \
--rdzv_backend=c10d \
--rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
train_ddp.py
# Results from all nodes synced!
4.2 Hyperparameter Search (Job Array)¶
#!/bin/bash
#SBATCH --job-name=hparam_search
#SBATCH --output=logs/search_%A_%a.out
#SBATCH --partition=GPU
#SBATCH --gres=gpu:1
#SBATCH --array=0-99%10
#SBATCH --time=12:00:00
# Activate scratch
source .activate_scratch
# Each array task gets different hyperparameters
SEED=$SLURM_ARRAY_TASK_ID
python3 train.py \
--seed $SEED \
--lr $(python3 -c "print(0.0001 * (1.5 ** $SEED))") \
--output results/seed_${SEED}/
# Each task's results synced separately!
4.3 Checkpoint Resume¶
#!/bin/bash
#SBATCH --job-name=resume_training
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --gres=gpu:2
#SBATCH --time=48:00:00
# Activate scratch
source .activate_scratch
# Copy previous checkpoint to scratch
if [ -f "$HOME/results_job_3500/checkpoints/best_model.pt" ]; then
cp "$HOME/results_job_3500/checkpoints/best_model.pt" checkpoints/
echo "Resumed from previous checkpoint"
fi
# Continue training
python3 train.py --resume checkpoints/best_model.pt
# New checkpoints automatically synced!
4.4 Custom Sync Strategy (Advanced)¶
#!/bin/bash
#SBATCH --job-name=custom_sync
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00
# Activate scratch
source .activate_scratch
echo "Running in: $SCRATCH_DIR"
# Long training with intermediate syncs
python3 train.py &
TRAIN_PID=$!
# Sync critical checkpoints every hour (while training runs)
while kill -0 $TRAIN_PID 2>/dev/null; do
sleep 3600
# Manual sync of critical files
if [ -f checkpoints/latest.pt ]; then
rsync -a checkpoints/latest.pt "$HOME/backup_checkpoints/"
echo "$(date): Synced intermediate checkpoint"
fi
done
wait $TRAIN_PID
# Epilog still syncs everything at the end!
5. Performance Comparison¶
I/O Speed Tests¶
| Operation | Home (NFS) | Scratch (Lustre) | Speedup |
|---|---|---|---|
| Write 1GB checkpoint | 8.3s | 0.2s | 41x faster |
| Read 10GB dataset | 145s | 3.1s | 47x faster |
| Create 10k small files | 287s | 4.2s | 68x faster |
| Random access (1M IOPS) | Fails | Works | ∞x |
Real Training Example¶
BERT-Large fine-tuning on SKQuAD:
| Metric | Home (NFS) | Scratch | Improvement |
|---|---|---|---|
| Checkpoint save | 12s/step | 0.3s/step | 40x faster |
| Data loading | 3.2s/batch | 0.1s/batch | 32x faster |
| Total epoch time | 4h 23m | 1h 12m | 3.6x faster |
| GPU utilization | 45% | 94% | 2x better |
Performance Impact
NFS stalls during checkpoint saves, Lustre doesn't.
6. Troubleshooting¶
6.1 Job Failed Immediately¶
Symptom
Job exits with "Permission denied" or "No such file"
Solution
Make sure you added source .activate_scratch:
6.2 Output Files Not Found After Job¶
Symptom
Can't find results after job completes
Solution
Check the results directory:
All outputs are synced here, not in the original submit directory!
6.3 Large Checkpoints Missing¶
Symptom
Some checkpoints didn't sync
Possible causes:
- Job hit time limit before epilog completed
- Disk quota exceeded
- Checkpoint was too large (>100GB needs more time)
Solution
# Check epilog log
ssh root@gpu01 'tail -100 /var/log/slurm/prolog-epilog/epilog-job*XXXXX*.log'
# Manually recover from scratch (if still exists)
rsync -avP /lustre/scratch/$USER/job_XXXXX/ ~/recovered_results/
6.4 Job Slower Than Expected¶
Symptom
Training is slow despite using scratch
Diagnostics:
# Check if actually running in scratch
squeue -j $JOBID -o "%i %Z" # WorkDir should be /lustre/scratch/...
# Check I/O wait
ssh gpu01 'iostat -x 1 5'
# Check if data is actually in scratch
ls -lh /lustre/scratch/$USER/job_$JOBID/data/
6.5 Monitoring Live Progress¶
View live output:
# From login node
tail -f ~/results_job_XXXXX/train_model_XXXXX.out
# Or from scratch (while running)
ssh gpu01 'tail -f /lustre/scratch/$USER/job_XXXXX/*.out'
Monitor prolog/epilog:
# Watch real-time sync progress
ssh root@gpu01 'tail -f /var/log/slurm/prolog-epilog/*.log | grep RSYNC'
7. Slurm Basics (Cheat Sheet)¶
Essential Commands¶
# Submit job
sbatch job.sh
# Check queue
squeue -u $USER
# Job details
scontrol show job XXXXX
# Cancel job
scancel XXXXX
# Job history
sacct -j XXXXX --format=JobID,State,Elapsed,MaxRSS,ReqMem
# Efficiency report
seff XXXXX
Common SBATCH Directives¶
#SBATCH --job-name=my_job # Job name
#SBATCH --output=%x_%j.out # Output file (%x=name, %j=jobid)
#SBATCH --error=%x_%j.err # Error file
#SBATCH --partition=GPU # Queue: CPU or GPU
#SBATCH --gres=gpu:2 # Request 2 GPUs
#SBATCH --nodes=1 # Number of nodes
#SBATCH --ntasks=1 # Number of processes
#SBATCH --cpus-per-task=8 # CPUs per process
#SBATCH --mem=64G # Memory
#SBATCH --time=24:00:00 # Time limit (HH:MM:SS)
Output Filename Patterns¶
| Pattern | Meaning | Example |
|---|---|---|
%x |
Job name | train_model |
%j |
Job ID | 3928 |
%A |
Array job ID | 4000 |
%a |
Array task ID | 5 |
%N |
Node name | gpu01 |
Recommended Pattern
8. Environment Variables¶
Available in Jobs¶
$SLURM_JOB_ID # Job ID
$SLURM_JOB_NAME # Job name
$SLURM_SUBMIT_DIR # Directory where sbatch was called
$SLURM_CPUS_PER_TASK # CPUs requested
$SLURM_NTASKS # Total tasks
$SLURM_NNODES # Number of nodes
$SLURM_NODELIST # List of nodes
$CUDA_VISIBLE_DEVICES # Visible GPUs (set by Slurm)
After source .activate_scratch¶
$SCRATCH_DIR # /lustre/scratch/$USER/job_$JOBID
$DATA_DIR # $SCRATCH_DIR/data
$TMPDIR # $SCRATCH_DIR/tmp
$RESULTS_DIR # $SCRATCH_DIR/results
Use in Python
import os
scratch = os.environ['SCRATCH_DIR']
checkpoint_dir = os.path.join(scratch, 'checkpoints')
os.makedirs(checkpoint_dir, exist_ok=True)
9. Best Practices¶
DO¶
Recommended Practices
- Always use scratch for training - 40x faster I/O
- Use
%x_%j.outfor output files - easier to track - Request appropriate resources - don't over-request
- Set realistic time limits - helps scheduler
- Test with short jobs first - debug before long runs
- Monitor job progress - use
tail -fon output - Keep code in Git - submit directory gets copied
DON'T¶
Avoid These Mistakes
- Don't write large files to home - use scratch!
- Don't request 8 GPUs if you use 1 - wastes resources
- Don't use
--nodelistin production - reduces flexibility - Don't forget
source .activate_scratch- defeats the purpose - Don't store results only in scratch - epilog syncs them
- Don't manually clean scratch - epilog does it
- Don't run interactive jobs 24/7 - use batch jobs
10. Migration Guide¶
If You Have Existing Jobs¶
Before (manual scratch):
#!/bin/bash
#SBATCH --partition=GPU
SCRATCH="/lustre/scratch/$USER/job_$SLURM_JOB_ID"
mkdir -p "$SCRATCH"
rsync -a "$SLURM_SUBMIT_DIR"/ "$SCRATCH"/
cd "$SCRATCH"
python3 train.py
rsync -a output/ "$SLURM_SUBMIT_DIR/output/"
rm -rf "$SCRATCH"
After (automatic):
Changes Needed
- Add
source .activate_scratch - Remove manual
mkdir,rsync,cd, cleanup - Results will be in
~/results_job_XXXXX/instead of submit dir - Update any hardcoded paths if needed
11. FAQ¶
Do I need to change my Python code?
No! Your code runs in scratch automatically. Paths stay the same.
What if my dataset is 5TB?
Don't copy it! Keep large datasets in /lustre/datasets/ and reference them directly.
Can I submit from any directory?
Yes! Prolog copies from wherever you run sbatch.
What if I need specific files NOT to sync?
Create .rsyncignore (advanced) or exclude in epilog config.
How long does sync take?
~1-2 seconds per GB. A 10GB checkpoint = ~15 seconds.
Can I monitor sync progress?
Yes! ssh root@NODE 'tail -f /var/log/slurm/prolog-epilog/epilog*.log | grep RSYNC'
What if job is killed mid-training?
Epilog still runs! Results are synced even for failed jobs.
Can I disable automatic scratch?
Yes, just don't add source .activate_scratch. Job runs in submit directory.
12. Complete Working Example¶
Here's a complete, production-ready training script:
#!/bin/bash
################################################################################
# BERT-Large Fine-tuning on SKQuAD Dataset
# Expected runtime: ~2 hours on 2x H200 GPUs
# Results: ~/results_job_XXXXX/
################################################################################
#SBATCH --job-name=skquad_bert
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --partition=GPU
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=96G
#SBATCH --time=04:00:00
# Activate automatic scratch
source .activate_scratch
# Environment setup
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export CUDA_LAUNCH_BLOCKING=0
# Print job info
echo "═══════════════════════════════════════════════════════════"
echo "Job ID: $SLURM_JOB_ID"
echo "Job Name: $SLURM_JOB_NAME"
echo "Node: $(hostname)"
echo "GPUs: $CUDA_VISIBLE_DEVICES"
echo "Scratch: $SCRATCH_DIR"
echo "═══════════════════════════════════════════════════════════"
echo
# Verify GPU access
nvidia-smi --query-gpu=name,memory.total --format=csv
echo
# Start training
python3 train_bert.py \
--model_name bert-large-uncased \
--dataset skquad \
--output_dir checkpoints/ \
--num_train_epochs 3 \
--per_device_train_batch_size 16 \
--learning_rate 3e-5 \
--warmup_steps 500 \
--save_steps 1000 \
--logging_steps 100 \
--fp16
echo
echo "═══════════════════════════════════════════════════════════"
echo "Training complete! Results syncing to ~/results_job_$SLURM_JOB_ID/"
echo "═══════════════════════════════════════════════════════════"
# Epilog automatically:
# 1. Syncs checkpoints/ → ~/results_job_XXXXX/checkpoints/
# 2. Syncs logs → ~/results_job_XXXXX/logs/
# 3. Creates job summary
# 4. Cleans up scratch
Submit the Job
Monitor Progress
After Completion