PERUN Scratch — User Guide¶
The scratch system automatically moves your data to fast Lustre storage before your job starts and copies results back to your home directory when it finishes.
Why Scratch?¶
| Storage | Speed | Best for |
|---|---|---|
| Home (NFS) | 25–100 MB/s | long-term data storage |
| Lustre scratch | 500–1600 MB/s | computation (I/O-intensive jobs) |
Scratch is temporary — data is automatically deleted after 7 days.
How It Works (Step by Step)¶
sbatch job.sh
│
▼
[PROLOG] home/submit_dir/ ──copies──▶ /lustre/scratch/<user>/job_<JOBID>/
│ (runs in background, ~1600 MB/s)
▼
[JOB] your code runs on Lustre (fast I/O)
│
▼
[EPILOG] scratch/ ──only new/modified──▶ home/submit_dir/results_job_<JOBID>/
(original input files are NOT copied back)
Minimal Example — How to Write a Job Script¶
The simplest approach — add #!/bin/bash -l as the first line:
#!/bin/bash -l
#SBATCH --job-name=my_job
#SBATCH --partition=cpu_short
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00
# You are automatically in $SCRATCH_DIR (thanks to -l)
# Use relative paths — everything runs on fast Lustre
python train.py --input data.csv --output model.pt
# Results are automatically copied to results_job_<JOBID>/
The Only Requirement
Add #!/bin/bash -l to the first line of your script. The system takes care of everything else.
GPU Job Example¶
#!/bin/bash -l
#SBATCH --job-name=gpu_train
#SBATCH --partition=gpu_short
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:1
#SBATCH --mem=64G
#SBATCH --time=04:00:00
echo "Working directory (scratch): $(pwd)"
echo "SCRATCH_DIR: $SCRATCH_DIR"
echo "GPU: $CUDA_VISIBLE_DEVICES"
source ~/miniconda3/etc/profile.d/conda.sh
conda activate myenv
python -u train.py \
--data dataset/ \
--output model.pt \
--checkpoint checkpoints/
# After the job finishes, results will be in:
# <submit_dir>/results_job_<JOBID>/
Environment Variables¶
These variables are automatically set in every job:
| Variable | Value | Use |
|---|---|---|
$SCRATCH_DIR |
/lustre/scratch/<user>/job_<JOBID> |
main working directory |
$TMPDIR |
$SCRATCH_DIR/tmp |
temporary files |
$RESULTS_DIR |
$SCRATCH_DIR/results |
output files |
Works Without -l Too
Even without #!/bin/bash -l, these variables are available. The only difference is that without -l, the system does not automatically change your working directory to scratch — you need to add cd $SCRATCH_DIR manually.
Monitoring Progress¶
While your job is running, you can monitor the copy progress:
# Copying data TO scratch (before computation)
tail -F <submit_dir>/.prolog_progress_<JOBID>.log
# Copying results BACK (after computation)
tail -F <submit_dir>/.epilog_progress_<JOBID>.log
Example output:
+==============================================================+
| COPYING DATA TO SCRATCH |
+==============================================================+
| Elapsed : 00:00:32 | cur:1593 MB/s avg:1593 MB/s |
| Files : 17 / 17 | Data: 450 MB / 450 MB |
| [##############################] 100% DONE |
+==============================================================+
| >> Job is now running on fast Lustre scratch |
| >> After finish: NEW/MODIFIED -> results_job_<JOBID>/ |
+==============================================================+
Where Are My Results?¶
After your job finishes, results appear next to your job script:
<your submit directory>/
├── job.sh ← your original script
├── data.csv ← original inputs (untouched)
└── results_job_49345/ ← results after the job ends
├── model.pt
├── training_log.txt
└── checkpoints/
└── epoch_50.pt
What Gets Copied Back?
Only new or modified files. Your original input files (data.csv, etc.) are not copied back — they remain in your home directory untouched, saving copy time.
Disabling Scratch (.no_scratch)¶
If your job does not need fast I/O, or your dataset is too large to copy, you can opt out:
The job will run directly from NFS (slower). Remove the file to re-enable scratch:
What Happens When You Cancel (scancel)?¶
| Situation | Result |
|---|---|
| Cancel before copy finishes | Copy is stopped. If your job already created data on scratch (e.g. checkpoints), the system will attempt to save them to results_job_<JOBID>/ |
| Cancel after copy finishes | Epilog runs normally — results appear in results_job_<JOBID>/ |
ML Checkpoints on Cancel
If you are training a model and cancel the job, the system automatically saves any checkpoints created during computation. You will find them in results_job_<JOBID>/.
Common Issues and Solutions¶
Job is stuck in state R but nothing seems to be happening
The job is likely waiting for the data copy to scratch to complete. Check:
Results are not in results_job_<JOBID>/
Check whether the job finished successfully:
If the job ended with FAILED, results are not copied back.
I want to use output from a previous job as input for a new one
Files in results_job_<JOBID>/ are not automatically copied to scratch (they are treated as old results). If you need them as input, move them first:
Special Files Reference¶
| File | Purpose |
|---|---|
.prolog_progress_<JOBID>.log |
Live progress of copy to scratch |
.epilog_progress_<JOBID>.log |
Live progress of results copy back |
.copy_ready |
Internal marker — copy to scratch is complete |
.no_scratch |
Create this to disable scratch for this directory |
.activate_scratch |
Internal file — do not modify |
Scratch Directory Layout¶
/lustre/scratch/<username>/
└── job_<JOBID>/ ← $SCRATCH_DIR (your working directory)
├── tmp/ ← $TMPDIR (temporary files)
├── logs/ ← log files
├── output/ ← output files
├── results/ ← $RESULTS_DIR
└── ... ← your copied input files
Automatic Cleanup
Scratch directories older than 7 days are automatically deleted. Do not rely on scratch for long-term storage — your results are safely saved to results_job_<JOBID>/ in your home directory after the job ends.