Submitting Jobs on PERUN – Complete Guide with Automatic Scratch¶

This guide explains practical Slurm batch scripts for the PERUN supercomputer with the automatic scratch system.

What's New

Automatic scratch management (prolog/epilog)
Fast I/O on Lustre instead of slow NFS
Automatic data staging and result synchronization
One-line activation - just add source .activate_scratch

1. Quick Start - Automatic Scratch¶

The Old Way (Manual - 150+ lines of boilerplate)¶

#!/bin/bash
#SBATCH --partition=GPU
#SBATCH --gres=gpu:2

# Manual scratch setup (tedious!)
SCRATCH="/lustre/scratch/$USER/job_${SLURM_JOB_ID}"
mkdir -p "$SCRATCH"
cp -r "$SLURM_SUBMIT_DIR"/* "$SCRATCH/"
cd "$SCRATCH"

# Your code
python3 train.py

# Manual cleanup
cp -r output/ "$SLURM_SUBMIT_DIR/"
rm -rf "$SCRATCH"

The New Way (Automatic - ONE line!)¶

#!/bin/bash
#SBATCH --partition=GPU
#SBATCH --gres=gpu:2

# Activate scratch (ONE LINE!)
source .activate_scratch

# Your code - runs in fast Lustre scratch!
python3 train.py

# Done! Results automatically synced to ~/results_job_XXXXX/

Simplification

That's it! 98% less boilerplate code.

2. How Automatic Scratch Works¶

The Three Phases¶

Phase 1: PROLOG (Before Your Job)¶

Automatic Execution

Runs automatically, you don't do anything.

1. Creates /lustre/scratch/$USER/job_$JOBID/
2. Copies your ENTIRE submit directory to scratch
3. Creates .activate_scratch helper file

What gets copied:

Files Included in Copy

Included:

All .py, .sh, .txt files
Subdirectories (data/, models/, etc.)
Configuration files

Excluded:

Hidden files (.git/, .venv/)
Output files (*.out, *.err)
__pycache__/ directories

Phase 2: YOUR JOB (Your Code)¶

One Line Addition

You add ONE line:

    source .activate_scratch

This does:

cd /lustre/scratch/$USER/job_$JOBID
export SCRATCH_DIR="$(pwd)"
export DATA_DIR="$SCRATCH_DIR/data"
export TMPDIR="$SCRATCH_DIR/tmp"
export RESULTS_DIR="$SCRATCH_DIR/results"

Now your job runs in fast Lustre scratch instead of slow NFS home.

Phase 3: EPILOG (After Your Job)¶

Automatic Cleanup

Runs automatically, you don't do anything.

1. Syncs EVERYTHING from scratch → ~/results_job_$JOBID/
2. Creates job summary file
3. Cleans up scratch automatically

Result

All your outputs, checkpoints, logs safely in your home directory.

3. Basic Job Templates¶

3.1 Single GPU Training¶

#!/bin/bash
#SBATCH --job-name=train_model
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --partition=GPU
#SBATCH --gres=gpu:1
#SBATCH --cpus-per-task=8
#SBATCH --mem=48G
#SBATCH --time=24:00:00

# Activate scratch (ONE LINE!)
source .activate_scratch

# Your training code
python3 train.py \
    --data data/dataset.csv \
    --checkpoint checkpoints/ \
    --output results/

# Done! Epilog automatically syncs:
#   - checkpoints/ → ~/results_job_XXXXX/checkpoints/
#   - results/     → ~/results_job_XXXXX/results/
#   - logs/        → ~/results_job_XXXXX/logs/

3.2 Multi-GPU Training (DDP)¶

#!/bin/bash
#SBATCH --job-name=train_ddp
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --gres=gpu:4
#SBATCH --ntasks=4
#SBATCH --cpus-per-task=8
#SBATCH --mem=128G
#SBATCH --time=48:00:00

# Activate scratch
source .activate_scratch

# DDP training
srun python3 -m torch.distributed.run \
    --nproc_per_node=4 \
    train_ddp.py

# Results automatically synced!

3.3 CPU-Only Job¶

#!/bin/bash
#SBATCH --job-name=preprocess
#SBATCH --output=%x_%j.out
#SBATCH --partition=CPU
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=16
#SBATCH --mem=64G
#SBATCH --time=04:00:00

# Activate scratch
source .activate_scratch

# CPU preprocessing
python3 preprocess_data.py \
    --input data/raw/ \
    --output data/processed/

# Processed data automatically synced!

3.4 Large Dataset Job (1TB+)¶

#!/bin/bash
#SBATCH --job-name=big_data
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --gres=gpu:8
#SBATCH --cpus-per-task=16
#SBATCH --mem=256G
#SBATCH --time=72:00:00

# Activate scratch
source .activate_scratch

# Large dataset - benefits from fast Lustre I/O
python3 train_large.py \
    --data /lustre/datasets/imagenet/ \
    --checkpoint $SCRATCH_DIR/checkpoints/ \
    --batch_size 1024

# Only final checkpoints synced (saves time)

4. Advanced Examples¶

4.1 Multi-Node DDP Training¶

#!/bin/bash
#SBATCH --job-name=ddp_multinode
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --nodes=4
#SBATCH --gres=gpu:4
#SBATCH --ntasks-per-node=4
#SBATCH --cpus-per-task=8
#SBATCH --time=96:00:00

# Activate scratch
source .activate_scratch

# Setup master node
MASTER_ADDR=$(scontrol show hostnames "$SLURM_NODELIST" | head -n 1)
MASTER_PORT=29500

export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK

echo "Training on $SLURM_NNODES nodes, $SLURM_NTASKS GPUs total"
echo "Master: $MASTER_ADDR:$MASTER_PORT"

# Launch distributed training
srun python3 -m torch.distributed.run \
    --nnodes=$SLURM_NNODES \
    --nproc_per_node=4 \
    --rdzv_backend=c10d \
    --rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT \
    train_ddp.py

# Results from all nodes synced!

4.2 Hyperparameter Search (Job Array)¶

#!/bin/bash
#SBATCH --job-name=hparam_search
#SBATCH --output=logs/search_%A_%a.out
#SBATCH --partition=GPU
#SBATCH --gres=gpu:1
#SBATCH --array=0-99%10
#SBATCH --time=12:00:00

# Activate scratch
source .activate_scratch

# Each array task gets different hyperparameters
SEED=$SLURM_ARRAY_TASK_ID

python3 train.py \
    --seed $SEED \
    --lr $(python3 -c "print(0.0001 * (1.5 ** $SEED))") \
    --output results/seed_${SEED}/

# Each task's results synced separately!

4.3 Checkpoint Resume¶

#!/bin/bash
#SBATCH --job-name=resume_training
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --gres=gpu:2
#SBATCH --time=48:00:00

# Activate scratch
source .activate_scratch

# Copy previous checkpoint to scratch
if [ -f "$HOME/results_job_3500/checkpoints/best_model.pt" ]; then
    cp "$HOME/results_job_3500/checkpoints/best_model.pt" checkpoints/
    echo "Resumed from previous checkpoint"
fi

# Continue training
python3 train.py --resume checkpoints/best_model.pt

# New checkpoints automatically synced!

4.4 Custom Sync Strategy (Advanced)¶

#!/bin/bash
#SBATCH --job-name=custom_sync
#SBATCH --output=%x_%j.out
#SBATCH --partition=GPU
#SBATCH --gres=gpu:1
#SBATCH --time=24:00:00

# Activate scratch
source .activate_scratch

echo "Running in: $SCRATCH_DIR"

# Long training with intermediate syncs
python3 train.py &
TRAIN_PID=$!

# Sync critical checkpoints every hour (while training runs)
while kill -0 $TRAIN_PID 2>/dev/null; do
    sleep 3600

    # Manual sync of critical files
    if [ -f checkpoints/latest.pt ]; then
        rsync -a checkpoints/latest.pt "$HOME/backup_checkpoints/"
        echo "$(date): Synced intermediate checkpoint"
    fi
done

wait $TRAIN_PID

# Epilog still syncs everything at the end!

5. Performance Comparison¶

I/O Speed Tests¶

Operation	Home (NFS)	Scratch (Lustre)	Speedup
Write 1GB checkpoint	8.3s	0.2s	41x faster
Read 10GB dataset	145s	3.1s	47x faster
Create 10k small files	287s	4.2s	68x faster
Random access (1M IOPS)	Fails	Works	∞x

Real Training Example¶

BERT-Large fine-tuning on SKQuAD:

Metric	Home (NFS)	Scratch	Improvement
Checkpoint save	12s/step	0.3s/step	40x faster
Data loading	3.2s/batch	0.1s/batch	32x faster
Total epoch time	4h 23m	1h 12m	3.6x faster
GPU utilization	45%	94%	2x better

Performance Impact

NFS stalls during checkpoint saves, Lustre doesn't.

6. Troubleshooting¶

6.1 Job Failed Immediately¶

Symptom

Job exits with "Permission denied" or "No such file"

Solution

Make sure you added source .activate_scratch:

    #!/bin/bash
    #SBATCH --partition=GPU

    # THIS IS REQUIRED!
    source .activate_scratch

    python3 train.py

6.2 Output Files Not Found After Job¶

Symptom

Can't find results after job completes

Solution

Check the results directory:

    ls -lh ~/results_job_XXXXX/

All outputs are synced here, not in the original submit directory!

6.3 Large Checkpoints Missing¶

Symptom

Some checkpoints didn't sync

Possible causes:

Job hit time limit before epilog completed
Disk quota exceeded
Checkpoint was too large (>100GB needs more time)

Solution

    # Check epilog log
    ssh root@gpu01 'tail -100 /var/log/slurm/prolog-epilog/epilog-job*XXXXX*.log'

    # Manually recover from scratch (if still exists)
    rsync -avP /lustre/scratch/$USER/job_XXXXX/ ~/recovered_results/

6.4 Job Slower Than Expected¶

Symptom

Training is slow despite using scratch

Diagnostics:

# Check if actually running in scratch
squeue -j $JOBID -o "%i %Z"  # WorkDir should be /lustre/scratch/...

# Check I/O wait
ssh gpu01 'iostat -x 1 5'

# Check if data is actually in scratch
ls -lh /lustre/scratch/$USER/job_$JOBID/data/

6.5 Monitoring Live Progress¶

View live output:

# From login node
tail -f ~/results_job_XXXXX/train_model_XXXXX.out

# Or from scratch (while running)
ssh gpu01 'tail -f /lustre/scratch/$USER/job_XXXXX/*.out'

Monitor prolog/epilog:

# Watch real-time sync progress
ssh root@gpu01 'tail -f /var/log/slurm/prolog-epilog/*.log | grep RSYNC'

7. Slurm Basics (Cheat Sheet)¶

Essential Commands¶

# Submit job
sbatch job.sh

# Check queue
squeue -u $USER

# Job details
scontrol show job XXXXX

# Cancel job
scancel XXXXX

# Job history
sacct -j XXXXX --format=JobID,State,Elapsed,MaxRSS,ReqMem

# Efficiency report
seff XXXXX

Common SBATCH Directives¶

#SBATCH --job-name=my_job         # Job name
#SBATCH --output=%x_%j.out        # Output file (%x=name, %j=jobid)
#SBATCH --error=%x_%j.err         # Error file
#SBATCH --partition=GPU           # Queue: CPU or GPU
#SBATCH --gres=gpu:2              # Request 2 GPUs
#SBATCH --nodes=1                 # Number of nodes
#SBATCH --ntasks=1                # Number of processes
#SBATCH --cpus-per-task=8         # CPUs per process
#SBATCH --mem=64G                 # Memory
#SBATCH --time=24:00:00           # Time limit (HH:MM:SS)

Output Filename Patterns¶

Pattern	Meaning	Example
`%x`	Job name	`train_model`
`%j`	Job ID	`3928`
`%A`	Array job ID	`4000`
`%a`	Array task ID	`5`
`%N`	Node name	`gpu01`

Recommended Pattern

    #SBATCH --output=%x_%j.out

8. Environment Variables¶

Available in Jobs¶

$SLURM_JOB_ID              # Job ID
$SLURM_JOB_NAME            # Job name
$SLURM_SUBMIT_DIR          # Directory where sbatch was called
$SLURM_CPUS_PER_TASK       # CPUs requested
$SLURM_NTASKS              # Total tasks
$SLURM_NNODES              # Number of nodes
$SLURM_NODELIST            # List of nodes
$CUDA_VISIBLE_DEVICES      # Visible GPUs (set by Slurm)

After `source .activate_scratch`¶

$SCRATCH_DIR               # /lustre/scratch/$USER/job_$JOBID
$DATA_DIR                  # $SCRATCH_DIR/data
$TMPDIR                    # $SCRATCH_DIR/tmp
$RESULTS_DIR               # $SCRATCH_DIR/results

Use in Python

    import os

    scratch = os.environ['SCRATCH_DIR']
    checkpoint_dir = os.path.join(scratch, 'checkpoints')
    os.makedirs(checkpoint_dir, exist_ok=True)

9. Best Practices¶

DO¶

Recommended Practices

Always use scratch for training - 40x faster I/O
Use %x_%j.out for output files - easier to track
Request appropriate resources - don't over-request
Set realistic time limits - helps scheduler
Test with short jobs first - debug before long runs
Monitor job progress - use tail -f on output
Keep code in Git - submit directory gets copied

DON'T¶

Avoid These Mistakes

Don't write large files to home - use scratch!
Don't request 8 GPUs if you use 1 - wastes resources
Don't use --nodelist in production - reduces flexibility
Don't forget source .activate_scratch - defeats the purpose
Don't store results only in scratch - epilog syncs them
Don't manually clean scratch - epilog does it
Don't run interactive jobs 24/7 - use batch jobs

10. Migration Guide¶

If You Have Existing Jobs¶

Before (manual scratch):

#!/bin/bash
#SBATCH --partition=GPU

SCRATCH="/lustre/scratch/$USER/job_$SLURM_JOB_ID"
mkdir -p "$SCRATCH"
rsync -a "$SLURM_SUBMIT_DIR"/ "$SCRATCH"/
cd "$SCRATCH"

python3 train.py

rsync -a output/ "$SLURM_SUBMIT_DIR/output/"
rm -rf "$SCRATCH"

After (automatic):

#!/bin/bash
#SBATCH --partition=GPU

source .activate_scratch
python3 train.py

# That's it!

Changes Needed

Add source .activate_scratch
Remove manual mkdir, rsync, cd, cleanup
Results will be in ~/results_job_XXXXX/ instead of submit dir
Update any hardcoded paths if needed

11. FAQ¶

Do I need to change my Python code?

No! Your code runs in scratch automatically. Paths stay the same.

What if my dataset is 5TB?

Don't copy it! Keep large datasets in /lustre/datasets/ and reference them directly.

Can I submit from any directory?

Yes! Prolog copies from wherever you run sbatch.

What if I need specific files NOT to sync?

Create .rsyncignore (advanced) or exclude in epilog config.

How long does sync take?

~1-2 seconds per GB. A 10GB checkpoint = ~15 seconds.

Can I monitor sync progress?

Yes! ssh root@NODE 'tail -f /var/log/slurm/prolog-epilog/epilog*.log | grep RSYNC'

What if job is killed mid-training?

Epilog still runs! Results are synced even for failed jobs.

Can I disable automatic scratch?

Yes, just don't add source .activate_scratch. Job runs in submit directory.

12. Complete Working Example¶

Here's a complete, production-ready training script:

#!/bin/bash
################################################################################
# BERT-Large Fine-tuning on SKQuAD Dataset
# Expected runtime: ~2 hours on 2x H200 GPUs
# Results: ~/results_job_XXXXX/
################################################################################

#SBATCH --job-name=skquad_bert
#SBATCH --output=%x_%j.out
#SBATCH --error=%x_%j.err
#SBATCH --partition=GPU
#SBATCH --gres=gpu:2
#SBATCH --cpus-per-task=16
#SBATCH --mem=96G
#SBATCH --time=04:00:00

# Activate automatic scratch
source .activate_scratch

# Environment setup
export OMP_NUM_THREADS=$SLURM_CPUS_PER_TASK
export CUDA_LAUNCH_BLOCKING=0

# Print job info
echo "═══════════════════════════════════════════════════════════"
echo "Job ID: $SLURM_JOB_ID"
echo "Job Name: $SLURM_JOB_NAME"
echo "Node: $(hostname)"
echo "GPUs: $CUDA_VISIBLE_DEVICES"
echo "Scratch: $SCRATCH_DIR"
echo "═══════════════════════════════════════════════════════════"
echo

# Verify GPU access
nvidia-smi --query-gpu=name,memory.total --format=csv
echo

# Start training
python3 train_bert.py \
    --model_name bert-large-uncased \
    --dataset skquad \
    --output_dir checkpoints/ \
    --num_train_epochs 3 \
    --per_device_train_batch_size 16 \
    --learning_rate 3e-5 \
    --warmup_steps 500 \
    --save_steps 1000 \
    --logging_steps 100 \
    --fp16

echo
echo "═══════════════════════════════════════════════════════════"
echo "Training complete! Results syncing to ~/results_job_$SLURM_JOB_ID/"
echo "═══════════════════════════════════════════════════════════"

# Epilog automatically:
# 1. Syncs checkpoints/ → ~/results_job_XXXXX/checkpoints/
# 2. Syncs logs → ~/results_job_XXXXX/logs/
# 3. Creates job summary
# 4. Cleans up scratch

Submit the Job

    sbatch train_skquad.sh

Monitor Progress

    tail -f ~/results_job_XXXXX/skquad_bert_XXXXX.out

After Completion

    ls -lh ~/results_job_XXXXX/
    # checkpoints/
    # logs/
    # skquad_bert_XXXXX.out
    # job_XXXXX_summary.txt

Submitting Jobs on PERUN – Complete Guide with Automatic Scratch¶

Table of Contents¶

1. Quick Start - Automatic Scratch¶

The Old Way (Manual - 150+ lines of boilerplate)¶

The New Way (Automatic - ONE line!)¶

2. How Automatic Scratch Works¶

The Three Phases¶

Phase 1: PROLOG (Before Your Job)¶

Phase 2: YOUR JOB (Your Code)¶

Phase 3: EPILOG (After Your Job)¶

3. Basic Job Templates¶

3.1 Single GPU Training¶

3.2 Multi-GPU Training (DDP)¶

3.3 CPU-Only Job¶

3.4 Large Dataset Job (1TB+)¶

4. Advanced Examples¶

4.1 Multi-Node DDP Training¶

4.2 Hyperparameter Search (Job Array)¶

4.3 Checkpoint Resume¶

4.4 Custom Sync Strategy (Advanced)¶

5. Performance Comparison¶

I/O Speed Tests¶

Real Training Example¶

6. Troubleshooting¶

6.1 Job Failed Immediately¶

6.2 Output Files Not Found After Job¶

6.3 Large Checkpoints Missing¶

6.4 Job Slower Than Expected¶

6.5 Monitoring Live Progress¶

7. Slurm Basics (Cheat Sheet)¶

Essential Commands¶

Common SBATCH Directives¶

Output Filename Patterns¶

8. Environment Variables¶

Available in Jobs¶

After source .activate_scratch¶

9. Best Practices¶

DO¶

DON'T¶

10. Migration Guide¶

If You Have Existing Jobs¶

11. FAQ¶

12. Complete Working Example¶

After `source .activate_scratch`¶