SLURM Tips and Tricks

Canceling jobs

scancel 1234                           # cancel job 1234
scancel -u myusername                  # cancel all my jobs
scancel -u myusername --state=running  # cancel all my running jobs
scancel -u myusername --state=pending  # cancel all my pending jobs

Environment variables

SLURM copies your environment variables by default. You don't need to tell it to set a variable on the command line for sbatch. Just set the variable in your environment before calling sbatch.

SLURM also sets helpful environment variables of it's own that can be used anywhere in your scripts. You can find a comprehensive list of these environment variables in the official SLURM documentation.

Advanced sbatch parameters

Switches

The switches parameter can be a useful parameter to use when requesting resources from SLURM. Essentially, given the current setup of the clusters, this limits the amount of chassis your job runs on. There are 16 nodes per chassis and normally SLURM will allocate your nodes anywhere there is availability. This could spread your program over many different chassis. Transferring data from one chassis to another can be fairly inefficient depending on how much data is being transferred. If your application has to transfer a lot of data between nodes as it is running you might want to use this parameter:

--switches=<count>[@<max-time>]

where count defines the maximum number of switches desired for the job. You can also specify the maximum time that SLURM will wait for that number of switches to become available. For example:

--switches=2@5:30:45

will tell SLURM to wait 5 hours 30 minutes and 45 seconds for 2 switches to become available for use. The maximum possible wait time is the default wait time and is configured by FSL staff.

For more information see the official sbatch documentation.

Job arrays

Job arrays allow you to submit many jobs at once, as one cohesive group. This is useful if you want to run the same program(s) on different data sets. This is much easier on the scheduler and is more convenient to use. Instead of submitting many jobs in a loop or using multiple bash input scripts you should simply use the array syntax:

-a, --array=<indexes>

where indexes is what array index value to use. Multiple values may be specified using a comma separated list and/or a range of values with a "-" separator.

By default SLURM will execute as many jobs within the array as there are resources available. You can limit this behavior by using the '%' operator. For example:

--array=0-15%4

will create a job array with size 16 and then limit the number of simultaneously running tasks from this job to 4. Currently, FSL staff have configured the maximum possible job array size to be 1000.

Using array environment variables

There are two useful environment variables that SLURM sets up when you use job arrays:

  • SLURM_ARRAY_JOB_ID, specifies the array's master job ID number.
  • SLURM_ARRAY_TASK_ID, specifies the job array index number.

For example, if you have two input data files input0.py and input1.py, instead of using two separate submission scripts:

#!/bin/bash
#SBATCH --time=01:00:00       # walltime
#SBATCH --ntasks=1            # number of processor cores (i.e. tasks)
#SBATCH --nodes=1             # number of nodes
#SBATCH --mem-per-cpu=1024M   # memory per CPU core
#SBATCH -J "Job0"             # job name

module load python/3.4.2
python input0.py

exit0
#!/bin/bash
#SBATCH --time=01:00:00       # walltime
#SBATCH --ntasks=1            # number of processor cores (i.e. tasks)
#SBATCH --nodes=1             # number of nodes
#SBATCH --mem-per-cpu=1024M   # memory per CPU core
#SBATCH -J "Job1"             # job name

module load python/3.4.2
python input1.py

exit0

You can simply use one submission script:

#!/bin/bash
#SBATCH --time=01:00:00       # walltime
#SBATCH --ntasks=1            # number of processor cores (i.e. tasks)
#SBATCH --nodes=1             # number of nodes
#SBATCH --mem-per-cpu=1024M   # memory per CPU core
#SBATCH -J "MyArrayJob"       # job name
#SBATCH --array=0-1           # job array of size 2

module load python/3.4.2
python input${SLURM_ARRAY_TASK_ID}.py # this will expand to be: 'python input0.py' or 'python input1.py'

exit 0

Note

Each index in the job array gets its own job ID, but the entire array has it's own master job ID. For example, if the sbatch command responds Submitted batch job 36, then the environment variables will be set as follows:

SLURM_JOBID=36
SLURM_ARRAY_JOB_ID=36
SLURM_ARRAY_TASK_ID=0

SLURM_JOBID=37
SLURM_ARRAY_JOB_ID=36
SLURM_ARRAY_TASK_ID=1

Scancel command use

You can cancel an entire job array at once by specifying the job ID of the entire job array. Alternately, an array ID (index) may be specified for job cancellation.

# Cancel array ID 1 to 3 from job array 20
$ scancel 20_[1-3]

# Cancel array ID 4 and 5 from job array 20
$ scancel 20_4 20_5

# Cancel all elements from job array 20
$ scancel 20

For more information, see the official job array documentation.