Skip to content

Submitting Jobs on PERUN – Complete Guide with Automatic Scratch

This guide explains practical Slurm batch scripts for the PERUN supercomputer with the automatic scratch system.

What's New

  • Automatic scratch management (prolog/epilog)
  • Fast I/O on Lustre instead of slow NFS
  • Automatic data staging and result synchronization
  • One-line activation - just add source .activate_scratch

Table of Contents

  1. Quick Start - Automatic Scratch
  2. How Automatic Scratch Works
  3. Basic Job Templates
  4. Advanced Examples
  5. Performance Comparison
  6. Troubleshooting
  7. Slurm Basics

1. Quick Start - Automatic Scratch

The Old Way (Manual - 150+ lines of boilerplate)

#!/bin/bash
#SBATCH --partition=GPU
#SBATCH --gres=gpu:2

# Manual scratch setup (tedious!)
SCRATCH="/lustre/scratch/$USER/job_${SLURM_JOB_ID}"
mkdir -p "$SCRATCH"
cp -r "$SLURM_SUBMIT_DIR"/* "$SCRATCH/"
cd "$SCRATCH"

# Your code
python3 train.py

# Manual cleanup
cp -r output/ "$SLURM_SUBMIT_DIR/"
rm -rf "$SCRATCH"

The New Way (Automatic - ONE line!)

#!/bin/bash
#SBATCH --partition=GPU
#SBATCH --gres=gpu:2

# Activate scratch (ONE LINE!)
source .activate_scratch

# Your code - runs in fast Lustre scratch!
python3 train.py

# Done! Results automatically synced to ~/results_job_XXXXX/

Simplification

That's it! 98% less boilerplate code.


2. How Automatic Scratch Works

The Three Phases

Phase 1: PROLOG (Before Your Job)

Automatic Execution

Runs automatically, you don't do anything.

1. Creates /lustre/scratch/$USER/job_$JOBID/
2. Copies your ENTIRE submit directory to scratch
3. Creates .activate_scratch helper file

What gets copied:

Files Included in Copy

Included:

  • All .py, .sh, .txt files
  • Subdirectories (data/, models/, etc.)
  • Configuration files

Excluded:

  • Hidden files (.git/, .venv/)
  • Output files (*.out, *.err)
  • __pycache__/ directories

Phase 2: YOUR JOB (Your Code)

One Line Addition

You add ONE line:

    source .activate_scratch

This does:

cd /lustre/scratch/$USER/job_$JOBID
export SCRATCH_DIR="$(pwd)"
export DATA_DIR="$SCRATCH_DIR/data"
export TMPDIR="$SCRATCH_DIR/tmp"
export RESULTS_DIR="$SCRATCH_DIR/results"

Now your job runs in fast Lustre scratch instead of slow NFS home.

Phase 3: EPILOG (After Your Job)

Automatic Cleanup

Runs automatically, you don't do anything.

1. Syncs EVERYTHING from scratch → ~/results_job_$JOBID/
2. Creates job summary file
3. Cleans up scratch automatically

Result

All your outputs, checkpoints, logs safely in your home directory.


3. Basic Job Templates

3.1 Single GPU Training

#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --partition=GPU
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=48G
#SBATCH --time=24:00:00

# Activate scratch (ONE LINE!)
source .activate_scratch

# Your training code
python3 train.py \
    --data data/dataset.csv \
    --checkpoint checkpoints/ \
    --output results/

# Done! Epilog automatically syncs:
#   - checkpoints/ → ~/results_job_XXXXX/checkpoints/
#   - results/     → ~/results_job_XXXXX/results/
#   - logs/        → ~/results_job_XXXXX/logs/

3.2 Multi-GPU Training (DDP)

#!/bin/bash
#SBATCH --job-name=train_ddp
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --gres=gpu:4
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=8
#SBATCH --mem=128G
#SBATCH --time=48:00:00

# Activate scratch
source .activate_scratch

# DDP training
srun python3 -m torch.distributed.run \
    --nproc_per_node=4 \
    train_ddp.py

# Results automatically synced!

3.3 CPU-Only Job

#!/bin/bash
#SBATCH --job-name=preprocess
#SBATCH --output=%x_%j.out
#SBATCH --partition=CPU
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=04:00:00

# Activate scratch
source .activate_scratch

# CPU preprocessing
python3 preprocess_data.py \
    --input data/raw/ \
    --output data/processed/

# Processed data automatically synced!

3.4 Large Dataset Job (1TB+)

#!/bin/bash
#SBATCH --job-name=big_data
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=72:00:00

# Activate scratch
source .activate_scratch

# Large dataset - benefits from fast Lustre I/O
python3 train_large.py \
    --data /lustre/datasets/imagenet/ \
    --checkpoint $SCRATCH_DIR/checkpoints/ \
    --batch_size 1024

# Only final checkpoints synced (saves time)

4. Advanced Examples

4.1 Multi-Node DDP Training

#!/bin/bash
#SBATCH --job-name=ddp_multinode
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --nodes=4
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=96:00:00

# Activate scratch
source .activate_scratch

# Setup master node
MASTER_ADDR=$(scontrol show hostnames "$SLURM_NODELIST" | head -n 1)
MASTER_PORT=29500

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

echo "Training on $SLURM_NNODES nodes, $SLURM_NTASKS GPUs total"
echo "Master: $MASTER_ADDR:$MASTER_PORT"

# Launch distributed training
srun python3 -m torch.distributed.run \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=4 \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    train_ddp.py

# Results from all nodes synced!

4.2 Hyperparameter Search (Job Array)

#!/bin/bash
#SBATCH --job-name=hparam_search
#SBATCH --output=logs/search_%A_%a.out
#SBATCH --partition=GPU
#SBATCH --gres=gpu:1
#SBATCH --array=0-99%10
#SBATCH --time=12:00:00

# Activate scratch
source .activate_scratch

# Each array task gets different hyperparameters
SEED=$SLURM_ARRAY_TASK_ID

python3 train.py \
    --seed $SEED \
    --lr $(python3 -c "print(0.0001 * (1.5 ** $SEED))") \
    --output results/seed_${SEED}/

# Each task's results synced separately!

4.3 Checkpoint Resume

#!/bin/bash
#SBATCH --job-name=resume_training
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --gres=gpu:2
#SBATCH --time=48:00:00

# Activate scratch
source .activate_scratch

# Copy previous checkpoint to scratch
if [ -f "$HOME/results_job_3500/checkpoints/best_model.pt" ]; then
    cp "$HOME/results_job_3500/checkpoints/best_model.pt" checkpoints/
    echo "Resumed from previous checkpoint"
fi

# Continue training
python3 train.py --resume checkpoints/best_model.pt

# New checkpoints automatically synced!

4.4 Custom Sync Strategy (Advanced)

#!/bin/bash
#SBATCH --job-name=custom_sync
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00

# Activate scratch
source .activate_scratch

echo "Running in: $SCRATCH_DIR"

# Long training with intermediate syncs
python3 train.py &
TRAIN_PID=$!

# Sync critical checkpoints every hour (while training runs)
while kill -0 $TRAIN_PID 2>/dev/null; do
    sleep 3600

    # Manual sync of critical files
    if [ -f checkpoints/latest.pt ]; then
        rsync -a checkpoints/latest.pt "$HOME/backup_checkpoints/"
        echo "$(date): Synced intermediate checkpoint"
    fi
done

wait $TRAIN_PID

# Epilog still syncs everything at the end!

5. Performance Comparison

I/O Speed Tests

Operation Home (NFS) Scratch (Lustre) Speedup
Write 1GB checkpoint 8.3s 0.2s 41x faster
Read 10GB dataset 145s 3.1s 47x faster
Create 10k small files 287s 4.2s 68x faster
Random access (1M IOPS) Fails Works ∞x

Real Training Example

BERT-Large fine-tuning on SKQuAD:

Metric Home (NFS) Scratch Improvement
Checkpoint save 12s/step 0.3s/step 40x faster
Data loading 3.2s/batch 0.1s/batch 32x faster
Total epoch time 4h 23m 1h 12m 3.6x faster
GPU utilization 45% 94% 2x better

Performance Impact

NFS stalls during checkpoint saves, Lustre doesn't.


6. Troubleshooting

6.1 Job Failed Immediately

Symptom

Job exits with "Permission denied" or "No such file"

Solution

Make sure you added source .activate_scratch:

    #!/bin/bash
    #SBATCH --partition=GPU

    # THIS IS REQUIRED!
    source .activate_scratch

    python3 train.py

6.2 Output Files Not Found After Job

Symptom

Can't find results after job completes

Solution

Check the results directory:

    ls -lh ~/results_job_XXXXX/
All outputs are synced here, not in the original submit directory!

6.3 Large Checkpoints Missing

Symptom

Some checkpoints didn't sync

Possible causes:

  • Job hit time limit before epilog completed
  • Disk quota exceeded
  • Checkpoint was too large (>100GB needs more time)

Solution

    # Check epilog log
    ssh root@gpu01 'tail -100 /var/log/slurm/prolog-epilog/epilog-job*XXXXX*.log'

    # Manually recover from scratch (if still exists)
    rsync -avP /lustre/scratch/$USER/job_XXXXX/ ~/recovered_results/

6.4 Job Slower Than Expected

Symptom

Training is slow despite using scratch

Diagnostics:

# Check if actually running in scratch
squeue -j $JOBID -o "%i %Z"  # WorkDir should be /lustre/scratch/...

# Check I/O wait
ssh gpu01 'iostat -x 1 5'

# Check if data is actually in scratch
ls -lh /lustre/scratch/$USER/job_$JOBID/data/

6.5 Monitoring Live Progress

View live output:

# From login node
tail -f ~/results_job_XXXXX/train_model_XXXXX.out

# Or from scratch (while running)
ssh gpu01 'tail -f /lustre/scratch/$USER/job_XXXXX/*.out'

Monitor prolog/epilog:

# Watch real-time sync progress
ssh root@gpu01 'tail -f /var/log/slurm/prolog-epilog/*.log | grep RSYNC'


7. Slurm Basics (Cheat Sheet)

Essential Commands

# Submit job
sbatch job.sh

# Check queue
squeue -u $USER

# Job details
scontrol show job XXXXX

# Cancel job
scancel XXXXX

# Job history
sacct -j XXXXX --format=JobID,State,Elapsed,MaxRSS,ReqMem

# Efficiency report
seff XXXXX

Common SBATCH Directives

#SBATCH --job-name=my_job         # Job name
#SBATCH --output=%x_%j.out        # Output file (%x=name, %j=jobid)
#SBATCH --error=%x_%j.err         # Error file
#SBATCH --partition=GPU           # Queue: CPU or GPU
#SBATCH --gres=gpu:2              # Request 2 GPUs
#SBATCH --nodes=1                 # Number of nodes
#SBATCH --ntasks=1                # Number of processes
#SBATCH --cpus-per-task=8         # CPUs per process
#SBATCH --mem=64G                 # Memory
#SBATCH --time=24:00:00           # Time limit (HH:MM:SS)

Output Filename Patterns

Pattern Meaning Example
%x Job name train_model
%j Job ID 3928
%A Array job ID 4000
%a Array task ID 5
%N Node name gpu01

Recommended Pattern

    #SBATCH --output=%x_%j.out

8. Environment Variables

Available in Jobs

$SLURM_JOB_ID              # Job ID
$SLURM_JOB_NAME            # Job name
$SLURM_SUBMIT_DIR          # Directory where sbatch was called
$SLURM_CPUS_PER_TASK       # CPUs requested
$SLURM_NTASKS              # Total tasks
$SLURM_NNODES              # Number of nodes
$SLURM_NODELIST            # List of nodes
$CUDA_VISIBLE_DEVICES      # Visible GPUs (set by Slurm)

After source .activate_scratch

$SCRATCH_DIR               # /lustre/scratch/$USER/job_$JOBID
$DATA_DIR                  # $SCRATCH_DIR/data
$TMPDIR                    # $SCRATCH_DIR/tmp
$RESULTS_DIR               # $SCRATCH_DIR/results

Use in Python

    import os

    scratch = os.environ['SCRATCH_DIR']
    checkpoint_dir = os.path.join(scratch, 'checkpoints')
    os.makedirs(checkpoint_dir, exist_ok=True)

9. Best Practices

DO

Recommended Practices

  • Always use scratch for training - 40x faster I/O
  • Use %x_%j.out for output files - easier to track
  • Request appropriate resources - don't over-request
  • Set realistic time limits - helps scheduler
  • Test with short jobs first - debug before long runs
  • Monitor job progress - use tail -f on output
  • Keep code in Git - submit directory gets copied

DON'T

Avoid These Mistakes

  • Don't write large files to home - use scratch!
  • Don't request 8 GPUs if you use 1 - wastes resources
  • Don't use --nodelist in production - reduces flexibility
  • Don't forget source .activate_scratch - defeats the purpose
  • Don't store results only in scratch - epilog syncs them
  • Don't manually clean scratch - epilog does it
  • Don't run interactive jobs 24/7 - use batch jobs

10. Migration Guide

If You Have Existing Jobs

Before (manual scratch):

#!/bin/bash
#SBATCH --partition=GPU

SCRATCH="/lustre/scratch/$USER/job_$SLURM_JOB_ID"
mkdir -p "$SCRATCH"
rsync -a "$SLURM_SUBMIT_DIR"/ "$SCRATCH"/
cd "$SCRATCH"

python3 train.py

rsync -a output/ "$SLURM_SUBMIT_DIR/output/"
rm -rf "$SCRATCH"

After (automatic):

#!/bin/bash
#SBATCH --partition=GPU

source .activate_scratch
python3 train.py

# That's it!

Changes Needed

  1. Add source .activate_scratch
  2. Remove manual mkdir, rsync, cd, cleanup
  3. Results will be in ~/results_job_XXXXX/ instead of submit dir
  4. Update any hardcoded paths if needed

11. FAQ

Do I need to change my Python code?

No! Your code runs in scratch automatically. Paths stay the same.

What if my dataset is 5TB?

Don't copy it! Keep large datasets in /lustre/datasets/ and reference them directly.

Can I submit from any directory?

Yes! Prolog copies from wherever you run sbatch.

What if I need specific files NOT to sync?

Create .rsyncignore (advanced) or exclude in epilog config.

How long does sync take?

~1-2 seconds per GB. A 10GB checkpoint = ~15 seconds.

Can I monitor sync progress?

Yes! ssh root@NODE 'tail -f /var/log/slurm/prolog-epilog/epilog*.log | grep RSYNC'

What if job is killed mid-training?

Epilog still runs! Results are synced even for failed jobs.

Can I disable automatic scratch?

Yes, just don't add source .activate_scratch. Job runs in submit directory.


12. Complete Working Example

Here's a complete, production-ready training script:

#!/bin/bash
################################################################################
# BERT-Large Fine-tuning on SKQuAD Dataset
# Expected runtime: ~2 hours on 2x H200 GPUs
# Results: ~/results_job_XXXXX/
################################################################################

#SBATCH --job-name=skquad_bert
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --partition=GPU
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=96G
#SBATCH --time=04:00:00

# Activate automatic scratch
source .activate_scratch

# Environment setup
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export CUDA_LAUNCH_BLOCKING=0

# Print job info
echo "═══════════════════════════════════════════════════════════"
echo "Job ID: $SLURM_JOB_ID"
echo "Job Name: $SLURM_JOB_NAME"
echo "Node: $(hostname)"
echo "GPUs: $CUDA_VISIBLE_DEVICES"
echo "Scratch: $SCRATCH_DIR"
echo "═══════════════════════════════════════════════════════════"
echo

# Verify GPU access
nvidia-smi --query-gpu=name,memory.total --format=csv
echo

# Start training
python3 train_bert.py \
    --model_name bert-large-uncased \
    --dataset skquad \
    --output_dir checkpoints/ \
    --num_train_epochs 3 \
    --per_device_train_batch_size 16 \
    --learning_rate 3e-5 \
    --warmup_steps 500 \
    --save_steps 1000 \
    --logging_steps 100 \
    --fp16

echo
echo "═══════════════════════════════════════════════════════════"
echo "Training complete! Results syncing to ~/results_job_$SLURM_JOB_ID/"
echo "═══════════════════════════════════════════════════════════"

# Epilog automatically:
# 1. Syncs checkpoints/ → ~/results_job_XXXXX/checkpoints/
# 2. Syncs logs → ~/results_job_XXXXX/logs/
# 3. Creates job summary
# 4. Cleans up scratch

Submit the Job

    sbatch train_skquad.sh

Monitor Progress

    tail -f ~/results_job_XXXXX/skquad_bert_XXXXX.out

After Completion

    ls -lh ~/results_job_XXXXX/
    # checkpoints/
    # logs/
    # skquad_bert_XXXXX.out
    # job_XXXXX_summary.txt