PERUN Scratch — User Guide¶

The scratch system automatically moves your data to fast Lustre storage before your job starts and copies results back to your home directory when it finishes.

Why Scratch?¶

Storage	Speed	Best for
Home (NFS)	25–100 MB/s	long-term data storage
Lustre scratch	500–1600 MB/s	computation (I/O-intensive jobs)

Scratch is temporary — data is automatically deleted after 7 days.

How It Works (Step by Step)¶

sbatch job.sh
    │
    ▼
[PROLOG]  home/submit_dir/ ──copies──▶ /lustre/scratch/<user>/job_<JOBID>/
    │              (runs in background, ~1600 MB/s)
    ▼
[JOB]     your code runs on Lustre (fast I/O)
    │
    ▼
[EPILOG]  scratch/ ──only new/modified──▶ home/submit_dir/results_job_<JOBID>/
              (original input files are NOT copied back)

Minimal Example — How to Write a Job Script¶

The simplest approach — add #!/bin/bash -l as the first line:

#!/bin/bash -l
#SBATCH --job-name=my_job
#SBATCH --partition=cpu_short
#SBATCH --nodes=1
#SBATCH --ntasks=4
#SBATCH --mem=16G
#SBATCH --time=02:00:00

# You are automatically in $SCRATCH_DIR (thanks to -l)
# Use relative paths — everything runs on fast Lustre

python train.py --input data.csv --output model.pt

# Results are automatically copied to results_job_<JOBID>/

The Only Requirement

Add #!/bin/bash -l to the first line of your script. The system takes care of everything else.

GPU Job Example¶

#!/bin/bash -l
#SBATCH --job-name=gpu_train
#SBATCH --partition=gpu_short
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --gres=gpu:1
#SBATCH --mem=64G
#SBATCH --time=04:00:00

echo "Working directory (scratch): $(pwd)"
echo "SCRATCH_DIR: $SCRATCH_DIR"
echo "GPU: $CUDA_VISIBLE_DEVICES"

source ~/miniconda3/etc/profile.d/conda.sh
conda activate myenv

python -u train.py \
    --data   dataset/         \
    --output model.pt         \
    --checkpoint checkpoints/

# After the job finishes, results will be in:
# <submit_dir>/results_job_<JOBID>/

Environment Variables¶

These variables are automatically set in every job:

Variable	Value	Use
`$SCRATCH_DIR`	`/lustre/scratch/<user>/job_<JOBID>`	main working directory
`$TMPDIR`	`$SCRATCH_DIR/tmp`	temporary files
`$RESULTS_DIR`	`$SCRATCH_DIR/results`	output files

Works Without -l Too

Even without #!/bin/bash -l, these variables are available. The only difference is that without -l, the system does not automatically change your working directory to scratch — you need to add cd $SCRATCH_DIR manually.

Monitoring Progress¶

While your job is running, you can monitor the copy progress:

# Copying data TO scratch (before computation)
tail -F <submit_dir>/.prolog_progress_<JOBID>.log

# Copying results BACK (after computation)
tail -F <submit_dir>/.epilog_progress_<JOBID>.log

Example output:

+==============================================================+
| COPYING DATA TO SCRATCH                                      |
+==============================================================+
| Elapsed : 00:00:32 | cur:1593 MB/s avg:1593 MB/s            |
| Files   : 17 / 17  | Data: 450 MB / 450 MB                  |
| [##############################] 100% DONE                   |
+==============================================================+
| >> Job is now running on fast Lustre scratch                 |
| >> After finish: NEW/MODIFIED -> results_job_<JOBID>/        |
+==============================================================+

Where Are My Results?¶

After your job finishes, results appear next to your job script:

<your submit directory>/
├── job.sh                          ← your original script
├── data.csv                        ← original inputs (untouched)
└── results_job_49345/              ← results after the job ends
    ├── model.pt
    ├── training_log.txt
    └── checkpoints/
        └── epoch_50.pt

What Gets Copied Back?

Only new or modified files. Your original input files (data.csv, etc.) are not copied back — they remain in your home directory untouched, saving copy time.

Disabling Scratch (`.no_scratch`)¶

If your job does not need fast I/O, or your dataset is too large to copy, you can opt out:

touch .no_scratch
sbatch job.sh

The job will run directly from NFS (slower). Remove the file to re-enable scratch:

rm .no_scratch

What Happens When You Cancel (`scancel`)?¶

Situation	Result
Cancel before copy finishes	Copy is stopped. If your job already created data on scratch (e.g. checkpoints), the system will attempt to save them to `results_job_<JOBID>/`
Cancel after copy finishes	Epilog runs normally — results appear in `results_job_<JOBID>/`

ML Checkpoints on Cancel

If you are training a model and cancel the job, the system automatically saves any checkpoints created during computation. You will find them in results_job_<JOBID>/.

Common Issues and Solutions¶

Job is stuck in state R but nothing seems to be happening

The job is likely waiting for the data copy to scratch to complete. Check:

tail -F <submit_dir>/.prolog_progress_<JOBID>.log

Results are not in results_job_<JOBID>/

Check whether the job finished successfully:

sacct -j <JOBID> --format=JobID,State,ExitCode

If the job ended with FAILED, results are not copied back.

I want to use output from a previous job as input for a new one

Files in results_job_<JOBID>/ are not automatically copied to scratch (they are treated as old results). If you need them as input, move them first:

cp results_job_49345/model.pt inputs/
sbatch job2.sh

Special Files Reference¶

File	Purpose
`.prolog_progress_<JOBID>.log`	Live progress of copy to scratch
`.epilog_progress_<JOBID>.log`	Live progress of results copy back
`.copy_ready`	Internal marker — copy to scratch is complete
`.no_scratch`	Create this to disable scratch for this directory
`.activate_scratch`	Internal file — do not modify

Scratch Directory Layout¶

/lustre/scratch/<username>/
└── job_<JOBID>/              ← $SCRATCH_DIR (your working directory)
    ├── tmp/                  ← $TMPDIR (temporary files)
    ├── logs/                 ← log files
    ├── output/               ← output files
    ├── results/              ← $RESULTS_DIR
    └── ...                   ← your copied input files

Automatic Cleanup

Scratch directories older than 7 days are automatically deleted. Do not rely on scratch for long-term storage — your results are safely saved to results_job_<JOBID>/ in your home directory after the job ends.