Automatic-job-requeue¶
Resubmit job when reaching timelimit¶
slurm allows to requeue a job upon preemption, but not when running into a timelimit. Occasionally, you want to resubmit a job automatically, possibly with a different timelimit or continuing the incomplete calculation. To do so, you need to the trap the signal sent to the job upon reaching the timelimit. and requeue the job inside a signal handler. A simple example is shown below.
#!/bin/bash -l
#SBATCH --job-name=test-restart
#SBATCH --output=test-restart.out
#SBATCH --time=0-00:03:00
#SBATCH --partition=maxcpu
unset LD_PRELOAD
# the sleep-loop at the end is running for max_iteration*30s
max_iteration=10
# only allow a single restart of the job.
max_restarts=1
# new partition and timelimit for 2nd and subsequent job runs
alt_partition=all
alt_timelimit=0-01:00:00
# just gather some information about the job
scontext=$(scontrol show job $SLURM_JOB_ID)
restarts=$(echo "$scontext" | grep -o 'Restarts=.' | cut -d= -f2)
outfile=$(echo "$scontext" | grep 'StdOut=' | cut -d= -f2)
errfile=$(echo "$scontext" | grep 'StdErr=' | cut -d= -f2)
timelimit=$(echo "$scontext" | grep -o 'TimeLimit=.*' | awk '{print $1}' | cut -d= -f2)
# term handler
# the function is executed once the job gets the TERM signal
term_handler()
{
echo "executing term_handler at $(date)"
if [[ $restarts -lt $max_restarts ]]; then
# copy the logfile. will be overwritten by the 2nd run
cp -v $outfile $outfile.$restarts
# requeue the job and put it on hold. It's not possible to change partition otherwise
scontrol requeuehold $SLURM_JOB_ID
# change timelimit and partition
scontrol update JobID=$SLURM_JOB_ID TimeLimit=$alt_timelimit Partition=$alt_partition
# release the job. It will wait in the queue for 2 minutes before the 2nd run can start
scontrol release $SLURM_JOB_ID
fi
}
# declare the function handling the TERM signal
trap 'term_handler' TERM
# print some job-information
cat <<EOF
SLURM_JOB_ID: $SLURM_JOB_ID
SLURM_JOB_NAME: $SLURM_JOB_NAME
SLURM_JOB_PARTITION: $SLURM_JOB_PARTITION
SLURM_SUBMIT_HOST: $SLURM_SUBMIT_HOST
TimeLimit: $timelimit
Restarts: $restarts
EOF
# the actual "calculation"
echo "starting calculation at $(date)"
i=0
while [[ $i -lt $max_iteration ]]; do
sleep 30
i=$(($i+1))
echo "$i out of $max_iteration done at $(date) "
done
echo "all done at $(date)"
The above script will run at least twice (unless it has finished before the timelimit). The first run - shown on the left - executes half of it in the maxcpu partition, before it runs into a timeout.
# output test-restart.out.0 of the first run:
# note: the job keeps the jobID!
SLURM_JOB_ID: 5744919
SLURM_JOB_NAME: test-restart
# first run on maxcpu partition for 3 minutes
SLURM_JOB_PARTITION: maxcpu
SLURM_SUBMIT_HOST: max-display001.desy.de
TimeLimit: 00:03:00
# job hasn't been restarted yet
Restarts: 0
starting calculation at Sun Oct 4 23:21:00 CEST 2020
1 out of 10 done at Sun Oct 4 23:21:30 CEST 2020
2 out of 10 done at Sun Oct 4 23:22:00 CEST 2020
3 out of 10 done at Sun Oct 4 23:22:30 CEST 2020
4 out of 10 done at Sun Oct 4 23:23:00 CEST 2020
5 out of 10 done at Sun Oct 4 23:23:30 CEST 2020
6 out of 10 done at Sun Oct 4 23:24:00 CEST 2020
# after 3 minutes plus a grace period of ~30 seconds the job receives a TERM signal
slurmstepd: error: *** JOB 5744919 ON max-wn050 CANCELLED AT 2020-10-04T23:24:26 DUE TO TIME LIMIT ***
Terminated
executing term_handler at Sun Oct 4 23:24:26 CEST 2020
After the timeout the job-script is executed a second time, this time in the allcpu partition. The timelimit and partition specified in the job-script are overwritten by the scontrol command...
# output test-restart.out of the second run:
# note: the job keeps the jobID!
SLURM_JOB_ID: 5744919
SLURM_JOB_NAME: test-restart
# second run on all partition with changes timelimit
SLURM_JOB_PARTITION: allcpu
SLURM_SUBMIT_HOST: max-display001.desy.de
TimeLimit: 01:00:00
# job has restarted once
Restarts: 1
starting calculation at Sun Oct 4 23:26:58 CEST 2020
1 out of 10 done at Sun Oct 4 23:27:28 CEST 2020
2 out of 10 done at Sun Oct 4 23:27:58 CEST 2020
3 out of 10 done at Sun Oct 4 23:28:28 CEST 2020
4 out of 10 done at Sun Oct 4 23:28:58 CEST 2020
5 out of 10 done at Sun Oct 4 23:29:28 CEST 2020
6 out of 10 done at Sun Oct 4 23:29:58 CEST 2020
7 out of 10 done at Sun Oct 4 23:30:28 CEST 2020
8 out of 10 done at Sun Oct 4 23:30:58 CEST 2020
9 out of 10 done at Sun Oct 4 23:31:28 CEST 2020
10 out of 10 done at Sun Oct 4 23:31:58 CEST 2020
all done at Sun Oct 4 23:31:58 CEST 2020