BYU

Office of Research Computing

Slurm

When you log in to the supercomputer via ssh or remote desktop, you'll find yourself on one of a few login nodes. Login nodes are meant for managing your data (moving, uploading/downloading, etc.), editing code, and other light tasks. The heavy computation, though, is meant to be done on one or more of our hundreds of compute nodes via Slurm. For this reason, tasks that use more than an hour of CPU time on the login nodes are killed (with a few exceptions, e.g. tar and rsync); for most long-running tasks, you should submit a job to be run on a compute node via Slurm.

Slurm's job is to fairly (by some definition of fair) and efficiently allocate compute resources. When you want to run a job, you tell Slurm how many resources (CPU cores, memory, etc.) you want and for how long; with this information, Slurm schedules your work along with that of other users. If your research group hasn't used many resources in the past few days and/or your resource request is modest, your job will likely start quickly; if your research group has been using the supercomputer heavily and/or you request a lot of resources, your job will probably wait in the queue for a while.

To run a job on a compute node, encapsulate your work in a shell script and use sbatch to submit it to Slurm.

Submitting Jobs

Batch jobs

Batch jobs are the primary means by which most users should interact with Slurm. A batch job can be as simple as a basic script:

#!/bin/bash
echo "Hello from $(hostname)"

...submitted by sbatch with a couple of flags:

sbatch --time 00:10:00 --mem 1G myscript.sh

For a list of the flags that sbatch takes, see our script generator and the man page; --mem (or --mem-per-cpu or --mem-per-gpu) and --time are required. The flags can be specified on the command line as above (highest precedence), as environment variables (medium precedence), or within the script (lowest precedence):

#!/bin/bash

# 4 cores on one node with 2 GiB memory total for one hour
#SBATCH --time 01:00:00
#SBATCH --mem 2G --ntasks 4 --nodes 1

./run_some_command

Launching tasks within jobs

srun can be used to launch processes within a job; these tasks are referred to as "job steps," and can be thought of as sub-allocations--sbatch reserves resources, and srun starts tasks within the job that use a subset of those resources. By default, srun launches the command in parallel on each CPU core within the job, but it can take most of the same arguments that sbatch can to specify which resources to use.

srun can also be used as a substitute for mpirun and mpiexec in most cases since most MPI implementations integrate with Slurm; if mpirun is giving you trouble, using srun in its place sometimes helps.

If this sounds useful, you may also be interested in GNU Parallel (module load parallel makes it available) and slurm-auto-array.

Interactive jobs

If you need to use a compute node interactively (e.g. to debug a GPU program), you can use salloc similarly to how you would use sbatch, but without a script name--for example, to get an interactive shell on a compute node with one GPU, four CPUs, and 8 GiB memory on one node for 2 hours, you could use:

salloc --time 2:00:00 --nodes 1 --ntasks 4 --gpus 1 --mem 8g

Most circumstances don't merit an interactive job; unless you are testing or debugging, stick to a job script to increase efficiency and reproducibility.

Graphical jobs

One exception to this rule is graphical jobs--when you need to use a lot of compute resources for visualization, e.g. to interact with a 3D representation of a protein. For such a job to work, you'll need to have a graphical session on the supercomputer, either via remote desktop (strongly recommended) or with ssh -X (slower than remote desktop). This done, start an interactive job with X11 forwarding enabled:

salloc --x11 ...

You can test that it's working by running xterm--it should launch a graphical terminal window.

Job arrays

Job arrays allow you to simultaneously submit many jobs as a cohesive group. This allows the same set of commands to be conveniently run on multiple data sets, and is easier on the scheduler. You can use raw job arrays for versatility, or use our custom job array automater for simplicity.

To submit a job array, add the --array=<indices> flag to your sbatch flags. <indices> is a comma-delimited list and/or dash-delimited range of numbers--for example, --array=1,3,5-6 would result in four array tasks numbered 1, 3, 5, and 6, while --array=1-8 would result in eight tasks numbered 1 through 8. You can also use a colon to specify a step function--to create tasks from 1 to 10 in steps of 3 (i.e. 1, 4, 7, and 10), specify --array=1-10:3. A percent sign can be used to limit the amount of jobs from a given job array running at once; to submit a job with 15 tasks that runs maximally 5 at a time, use --array=1-15%5.

The indices specified correspond to the environment variable within the job by which you can differentiate tasks, SLURM_ARRAY_TASK_ID. As an example, if you need to process 20 files named "to_process_$N.in", where N are all even numbers from 2 to 40, your job script might look something like this:

#!/bin/bash

#SBATCH --time 10 --mem 1G --array 2-40:2

infile="to_process_${SLURM_ARRAY_TASK_ID}.in"
outfile="processed_${SLURM_ARRAY_TASK_ID}.out"

process "$infile" "$outfile"

This is an idealized case--mapping between SLURM_ARRAY_TASK_ID and the work that each task must run is usually more abstract. Most of the time, it's easiest to use slurm-auto-array in such cases, but if you need fine-grained control and are pretty good at shell scripting, it's usually not too bad to set up a raw job array. As an example, to map SLURM_ARRAY_TASK_ID onto a unique input file of arbitrary name within the directory input_files, you could use something like:

infile="$(find input_files | sort | sed -n ${SLURM_ARRAY_TASK_ID}p)"

...and submit the script with --array=1-<number-of-input-files>.

Job dependencies

Job dependencies allow you to submit jobs that will start only after other jobs start or complete. To submit a job that is dependent on job 1234, add the flag --dependency=<type>:1234 to your sbatch flags. The most commonly used "types" of dependencies are:

Dependency Description
after:JOBID[:JOBID...] job may be started at any time after specified jobs have started execution
afterany:JOBID[:JOBID...] job may be started at any time after all specified jobs have completed regardless of completion status
afterok:JOBID[:JOBID...] job may be started at any time after all specified jobs have successfully completed
afternotok:JOBID[:JOBID...] job may be started at any time after any specified jobs have completed unsuccessfully

For example, to submit jobs A and B, with job B depending on job A's successful completion, you could use:

jobA_id=$(sbatch --parsable jobA.sh) # `jobA_id` now contains a job ID
sbatch --dependency=afterok:$jobA_id jobB.sh # jobB won't start until jobA completes successfully

It is also possible to retrieve the job identifier from within a job using the environment variable SLURM_JOB_ID, which contains the job ID of the current job. This allows a Slurm job to submit other jobs that depend on itself.

Note that if you cancel a job with scancel, jobs that depend on its successful completion will also be cancelled.

Scheduling Considerations

As a general rule, users should request only the hardware and features that they need. Each additional resource requested further constrains the scheduler, increasing the amount of time a job will wait in the queue.

Efficiency and throughput

Jobs that use less resources and/or complete faster allow you to get more work done, both by decreasing the resources required for each job and by keeping your scheduling priority high. Optimize both your software and your workflow so you can get more done faster. If you have questions about workflow optimization that aren't answered here, ask us.

Checkpointing

Some programs incorporate a mechanism that allows them to "checkpoint" their current state, exit, and resume work where they left off. The program's documentation will usually mention this if it is possible. For programs that don't implement checkpointing internally, you can use an external checkpointer like DMTCP (use module load dmtcp to make it available).

If you write your own software, implementing checkpointing is good practice, especially if the software will be used to run resource-intensive simulations. It allows you to be flexible with resource requests and to get higher throughput via preemptable jobs.

More resources

If your program scales well--meaning that increasing the number of CPU cores or GPUs it uses results in a commensurate decrease in runtime--using more resources for less time is a good way to speed up your jobs with minimal effort.

Better programming

If you write your own software, optimize it--if it's worth running on the supercomputer, it's worth optimizing. You should use an appropriate language like C++ or Julia, or at least let optimized and compiled libraries do the heavy lifting if you use a slow interpreted language like Python or R.

Node features

If you need a hardware feature that isn't universally available (e.g. AVX512), use -C <feature>. The available features have to do with CPU instruction sets (fma, avx, avx2, avx512), architecture (skylake, knl), networking (ib for Infiniband, opa for Omnipath), or GPUs (kepler for K80s, pascal for P100s). You can use -C 'a&b' to request both a and b, -C 'a|b' to request either of a or b, and -C '!a' to request node(s) without a.

If you don't know what a feature is and how it will benefit your job, don't request it.

Software Licenses

Some software requires permission from a license server to run, and permission is only granted to a finite amount of simultaneous instances. Slurm can ensure that your job doesn't start until the required licenses can be provided. To use 4 A licenses and 2 B licenses, use --licenses=A:4,B:2. To ask about available licenses or request that a new one be monitored, open a ticket.

Preemption

If you have work that can handle premature termination (via checkpointing), you can increase throughput by submitting preemptable jobs. Preemptable jobs have some limits lifted (number of jobs and processors per user, maximum runtime), can run on private hardware owned by other research groups, and are afforded higher scheduling priority in exchange for the possibility that they will be killed before their walltime is up. Regular jobs that Slurm can't fit elsewhere immediately take the place of preemptable jobs, so whether a preemptable job will run without interruption for a minute or a week depends heavily on the length of the queue.

Adding --qos=standby to your submission flags will make your job preemptable. If your job is set up to automatically handle unexpected termination and restarts, you can also add --requeue to your submission flags, which will cause the job to immediately resubmit (requesting the same resources) if it is killed by Slurm; only use --requeue if your job is specifically designed to bear automatic restarts.

Switches

If you run a job that will use multiple nodes and is latency-sensitive (e.g. an MPI program that involves frequent, small, blocking data exchanges between processes), you might want to limit the effective communication distance between processes. The --switches flag allows you to limit the amount of switches between nodes in the job allocation (e.g. --switches=2). The more processors and fewer switches you request, the more constrained the scheduler will be and the longer your job will wait in the queue; to limit the amount of time your job waits for the requested switch limit, you can append @<time-limit> to the switch request (e.g. --switches=4@3:00:00 to limit the wait to 3 hours).

Quality of service (QOS)

QOS allow delimitation between different sets of resources. For example, when users purchase private hardware, we set up a QOS that they can use to submit jobs to run on that hardware without affecting their priority on the normal QOS. QOS are also the means by which preemptable jobs are submitted.

Use --qos=<qos-name> to submit to a non-default QOS.

Test QOS

The test QOS is a QOS with stricter resource limits (one hour of walltime and 512 total processors) and shorter queue times, meant for developing and testing software and workflows. The resources used in this QOS count toward resource usage the same as the normal QOS.

Managing Jobs

Checking jobs

Use squeue to see all jobs in Slurm's queue; to restrict the (overwhelming) output to jobs from your user or your research group, use squeue -u $USER or squeue -A <your-professor's-netid>. Other useful flags are --long, which gives more detailed output, and --state=<state>, where <state> can be PENDING, RUNNING, or many others.

You can also use scontrol to see details about a particular job in a nice-to-view way. To display information about job 1234, use scontrol show job 1234.

sacct can give accounting information about pending, running and finished jobs in more detail than one's job statistics page. Unlike squeue and scontrol, it can display information about jobs from months or years past. This page goes into great detail on how to use sacct.

Changing jobs

Holding and releasing

scontrol hold <jobid> "holds" a job; this means that the job is still known by the scheduler, but it won't submit until you "release" it with scontrol release <jobid>. This can be useful if, for example, you forgot to copy over an input file and want to copy and modify it before the job runs. It's also possible to submit a job in a held state with sbatch --hold ....

Cancelling jobs

If you realize that a job isn't going to work or is no longer needed, use scancel <jobid> to cancel it. It's also possible to cancel all of your jobs (scancel -u $USER), or all of your jobs that are in a certain state (scancel -u $USER -t pending).

You can also use scancel to send arbitrary signals, which is convenient for things like manual checkpointing. For example, if you've set your job up to checkpoint when it receives SIGUSR1, you can use scancel --signal USR1 <jobid> to trigger a checkpoint.

Faculty

Faculty can hold, cancel, and release their students' jobs, adjust resource limits and relative priority of their students, and grant private hardware access to anyone; see this page for details.