BYU

Office of Research Computing

Using sacct

sacct lets you examine your pending, running, and finished Slurm jobs in much more detail than the job statistics page.

Constraints

By default sacct shows information about pending, running, and recently completed jobs. If you want information about different jobs--older jobs, for example, or only jobs in a certain state--you can give flags to sacct to refine your search. A few of the most useful flags are documented here; see the man page for a comprehensive list.

If you plan on using sacct as input to another program, you may want to use the -p/--parsable or -P/--parsable2 flags so that the output is delimited with '|' characters rather than whitespace.

To filter out job steps (e.g. with job ID '1234.extern' and '1234.0'), use the -X/--allocations flag. Note that this will do away with some performance information about your jobs if you use a non-default format.

Date

Often you'll want information on jobs from a specific time window; use -S/--starttime and -E/--endtime to control this. The time window is inclusive--a job ended at 12:01 AM on January 1st 2020 will be included if --starttime 2020-01-01 is specified.

To see jobs that started after 2:00 PM on April 17th 2020 and ended before midnight on April 25th without job steps:

sacct --starttime 2020-04-17T14:00:00 --endtime 2020-04-25T23:59:59 --allocations

If you want to specify dates without time constraints, the simplest time format is MMDD. To see jobs from between August 12 and 18th:

sacct -S 0812 -E 0818

There are also some strings that can be used as start and end times. To see jobs from between midnight and right now:

sacct --starttime midnight --endtime now

You can use the date command in conjunction with sacct for more advanced string representation of dates. To see jobs from between 1 month ago and 2 weeks ago:

sacct -S $(date -d '1 month ago' +%D-%R) -E $(date -d '2 weeks ago' +%D-%R)

Job name and ID

If you know a specific job ID that you would like information about, you can use the -j/--jobs flag:

sacct --jobs 1234

A comma-separated list of jobs can be specified:

sacct -j 5678,5679,5701

You can also filter by job name (the -J/--job-name flag for sbatch and salloc) with the --name flag:

sacct --name my-test-job

--name also takes comma-separated lists:

sacct --name job-name-1,job-name-2,job-name-3

Job state

You can filter jobs by their state with -s/--state whether they have finished, are running, or have yet to start. To see all pending and running jobs:

sacct --state PENDING,RUNNING

For jobs that have already finished you will probably want to specify a date range. For example, to see all jobs that have timed out in the past week and before midnight today:

sacct --starttime $(date -d 'last week' +%D-%R) --endtime midnight --state TIMEOUT

To see jobs that completed, either normally or abnormally, since July 12th, without job steps:

sacct -S 0712 -s CD,F -X

Here are the most useful states that can be specified, in both long and short forms:

Short Long Explanation
CA CANCELLED Job was cancelled by the user or a sysadmin
CD COMPLETED Job finished normally, with exit code 0
F FAILED Job finished abnormally, with a non-zero exit code
OOM OUT_OF_MEMORY Job was killed for using too much memory
PD PENDING Job is waiting to start
R RUNNING Job is currently running
TO TIMEOUT Job was killed for exceeding its time limit

A full list is available on the man page

Time limit

If you want to filter for jobs with a certain time limit, use the -k/--timelimit-min and -K/--timelimit-max flags. To show only jobs with time limit between 48 and 72 hours that ran between 4 and 8 weeks ago:

sacct -S $(date -d '8 weeks ago' +%D-%R) -E $(date -d '4 weeks ago' +%D-%R) -k 48:00 -K 72:00

If you know the exact time limit of the jobs you are looking for, set both min and max time limit to the same value. For jobs with a time limit of 30 minutes that ran in the last month:

sacct -S $(date -d 'last month' +%D-%R) --timelimit-min 30 --timelimit-max 30

Output format

By default sacct gives fairly basic information about a job: its ID and name, which partition it ran on or will run on, the associated Slurm account, how many CPUs it used or will use, its state, and its exit code. The -o/--format flag can be used to change this; use sacct -e to list the possible fields.

You can also set the environment variable SACCT_FORMAT to specify the format; this is useful if you want to override the default by putting e.g. export SACCT_FORMAT=field1,field2,field3,... in your ~/.bash_profile and/or ~/.bashrc. The -o/--format flag has higher precedence than the environment variable.

See "Job Accounting Fields" on the man page for a full explanation of what each field means.

Some of the most useful flags:

Flag Meaning
JobID The job ID
JobName Name of the job (specified with -J/--job-name)
ExitCode The job's exit code
State The job's state (see "Job State" above)
Partition Partition the job ran in
QOS Name of the QOS the job ran under

For example, to see the ID and name of all jobs from the past year along with their elapsed time and exit codes:

sacct -S $(date -d 'last year' +%D-%R) --format JobID,JobName,Elapsed,ExitCode

You can increase or decrease the number of characters allocated for a column by appending '%N' (right justified) or '%-N' (left justified) to the field; for example, if you have unusually long job names and would like the job name column to have 40 characters, left justified:

sacct -S 1012 -E 1014 -o JobID,JobName%-40,State

The most useful flags dealing with different aspects of jobs are as follows:

Time

Flag Meaning
Submit When the job was submitted
Start When the job started
End When the job ended
TimeLimit How much time the job was allocated
Elapsed How much time the job used

CPU

Flag Meaning
NCPUs Number of CPUs used by the job
NNodes Number of nodes used by the job
UserCPU User CPU time used by the job
SystemCPU System CPU time used by the job
TotalCPU Total CPU time used by the job; sum of UserCPU and SystemCPU
CPUTime Elapsed*NCPUs (total CPU time a perfectly efficient job would use)

To find how efficiently the CPUs were used, divide TotalCPU by CPUTime.

Memory

Flag Meaning
ReqMem Amount of memory requested; suffixed with 'c' if per CPU, 'n' if per node
AveRSS Average memory use for all tasks
MaxRSS Maximum memory use of any task

AveRSS and MaxRSS will usually be the same since most jobs consist of just one task; this is the case for any of the Ave* and Max* fields.

sacct's memory usage measurement doesn't catch rapid memory spikes; if your job got killed for running out of memory, it did run out of memory even if sacct reports a lower memory usage than would trigger an OOM-kill.

I/O

Flag Meaning
AveDiskRead Average number of bytes read for all tasks
MaxDiskRead Maximum number of bytes read for any task
AveDiskWrite Average number of bytes written for all tasks
MaxDiskWrite Maximum number of bytes read for any task
AvePages Average number of page faults for all tasks
MaxPages Maximum number of page faults for any task

A job that reads or writes excessively will be bogged down significantly by I/O.

Examples

To get a general idea of how efficiently a job utlized its resources, the following format can be used:

JobID,JobName,Elapsed,NCPUs,TotalCPU,CPUTime,ReqMem,MaxRSS,MaxDiskRead,MaxDiskWrite,State,ExitCode

Use this format to look at all jobs named 'julia-lang-ftw' that ran with a wall time of at least 2 days and were killed for running out of memory in the past 3 months:

sacct --name julia-lang-ftw --starttime $(date -d '3 months ago' +%D-%R) --state OUT_OF_MEMORY --timelimit-min 2-00:00:00 \
      --format JobID,JobName,Elapsed,NCPUs,TotalCPU,CPUTime,ReqMem,MaxRSS,MaxDiskRead,MaxDiskWrite,State,ExitCode