Using sacct
sacct
lets you examine your pending, running, and finished Slurm jobs in much more detail than the job statistics page.
Constraints
By default sacct
shows information about pending, running, and recently completed jobs. If you want information about different jobs--older jobs, for example, or only jobs in a certain state--you can give flags to sacct
to refine your search. A few of the most useful flags are documented here; see the man page for a comprehensive list.
If you plan on using sacct
as input to another program, you may want to use the -p
/--parsable
or -P
/--parsable2
flags so that the output is delimited with '|
' characters rather than whitespace.
To filter out job steps (e.g. with job ID '1234.extern
' and '1234.0
'), use the -X
/--allocations
flag. Note that this will do away with some performance information about your jobs if you use a non-default format.
Date
Often you'll want information on jobs from a specific time window; use -S
/--starttime
and -E
/--endtime
to control this. The time window is inclusive--a job ended at 12:01 AM on January 1st 2020 will be included if --starttime 2020-01-01
is specified.
To see jobs that started after 2:00 PM on April 17th 2020 and ended before midnight on April 25th without job steps:
sacct --starttime 2020-04-17T14:00:00 --endtime 2020-04-25T23:59:59 --allocations
If you want to specify dates without time constraints, the simplest time format is MMDD
. To see jobs from between August 12 and 18th:
sacct -S 0812 -E 0818
There are also some strings that can be used as start and end times. To see jobs from between midnight and right now:
sacct --starttime midnight --endtime now
You can use the date
command in conjunction with sacct
for more advanced string representation of dates. To see jobs from between 1 month ago and 2 weeks ago:
sacct -S $(date -d '1 month ago' +%D-%R) -E $(date -d '2 weeks ago' +%D-%R)
Job name and ID
If you know a specific job ID that you would like information about, you can use the -j
/--jobs
flag:
sacct --jobs 1234
A comma-separated list of jobs can be specified:
sacct -j 5678,5679,5701
You can also filter by job name (the -J
/--job-name
flag for sbatch
and salloc
) with the --name
flag:
sacct --name my-test-job
--name
also takes comma-separated lists:
sacct --name job-name-1,job-name-2,job-name-3
Job state
You can filter jobs by their state with -s
/--state
whether they have finished, are running, or have yet to start. To see all pending and running jobs:
sacct --state PENDING,RUNNING
For jobs that have already finished you will probably want to specify a date range. For example, to see all jobs that have timed out in the past week and before midnight today:
sacct --starttime $(date -d 'last week' +%D-%R) --endtime midnight --state TIMEOUT
To see jobs that completed, either normally or abnormally, since July 12th, without job steps:
sacct -S 0712 -s CD,F -X
Here are the most useful states that can be specified, in both long and short forms:
Short | Long | Explanation |
---|---|---|
CA |
CANCELLED |
Job was cancelled by the user or a sysadmin |
CD |
COMPLETED |
Job finished normally, with exit code 0 |
F |
FAILED |
Job finished abnormally, with a non-zero exit code |
OOM |
OUT_OF_MEMORY |
Job was killed for using too much memory |
PD |
PENDING |
Job is waiting to start |
R |
RUNNING |
Job is currently running |
TO |
TIMEOUT |
Job was killed for exceeding its time limit |
A full list is available on the man page
Time limit
If you want to filter for jobs with a certain time limit, use the -k
/--timelimit-min
and -K
/--timelimit-max
flags. To show only jobs with time limit between 48 and 72 hours that ran between 4 and 8 weeks ago:
sacct -S $(date -d '8 weeks ago' +%D-%R) -E $(date -d '4 weeks ago' +%D-%R) -k 48:00 -K 72:00
If you know the exact time limit of the jobs you are looking for, set both min and max time limit to the same value. For jobs with a time limit of 30 minutes that ran in the last month:
sacct -S $(date -d 'last month' +%D-%R) --timelimit-min 30 --timelimit-max 30
Output format
By default sacct
gives fairly basic information about a job: its ID and name, which partition it ran on or will run on, the associated Slurm account, how many CPUs it used or will use, its state, and its exit code. The -o
/--format
flag can be used to change this; use sacct -e
to list the possible fields.
You can also set the environment variable SACCT_FORMAT
to specify the format; this is useful if you want to override the default by putting e.g. export SACCT_FORMAT=field1,field2,field3,...
in your ~/.bash_profile
and/or ~/.bashrc
. The -o
/--format
flag has higher precedence than the environment variable.
See "Job Accounting Fields" on the man page for a full explanation of what each field means.
Some of the most useful flags:
Flag | Meaning |
---|---|
JobID |
The job ID |
JobName |
Name of the job (specified with -J /--job-name ) |
ExitCode |
The job's exit code |
State |
The job's state (see "Job State" above) |
Partition |
Partition the job ran in |
QOS |
Name of the QOS the job ran under |
For example, to see the ID and name of all jobs from the past year along with their elapsed time and exit codes:
sacct -S $(date -d 'last year' +%D-%R) --format JobID,JobName,Elapsed,ExitCode
You can increase or decrease the number of characters allocated for a column by appending '%N
' (right justified) or '%-N
' (left justified) to the field; for example, if you have unusually long job names and would like the job name column to have 40 characters, left justified:
sacct -S 1012 -E 1014 -o JobID,JobName%-40,State
The most useful flags dealing with different aspects of jobs are as follows:
Time
Flag | Meaning |
---|---|
Submit |
When the job was submitted |
Start |
When the job started |
End |
When the job ended |
TimeLimit |
How much time the job was allocated |
Elapsed |
How much time the job used |
CPU
Flag | Meaning |
---|---|
NCPUs |
Number of CPUs used by the job |
NNodes |
Number of nodes used by the job |
UserCPU |
User CPU time used by the job |
SystemCPU |
System CPU time used by the job |
TotalCPU |
Total CPU time used by the job; sum of UserCPU and SystemCPU
|
CPUTime |
Elapsed *NCPUs (total CPU time a perfectly efficient job would use) |
To find how efficiently the CPUs were used, divide TotalCPU
by CPUTime
.
Memory
Flag | Meaning |
---|---|
ReqMem |
Amount of memory requested; suffixed with 'c' if per CPU, 'n' if per node |
AveRSS |
Average memory use for all tasks |
MaxRSS |
Maximum memory use of any task |
AveRSS
and MaxRSS
will usually be the same since most jobs consist of just one task; this is the case for any of the Ave*
and Max*
fields.
sacct
's memory usage measurement doesn't catch rapid memory spikes; if your job got killed for running out of memory, it did run out of memory even if sacct
reports a lower memory usage than would trigger an OOM-kill.
I/O
Flag | Meaning |
---|---|
AveDiskRead |
Average number of bytes read for all tasks |
MaxDiskRead |
Maximum number of bytes read for any task |
AveDiskWrite |
Average number of bytes written for all tasks |
MaxDiskWrite |
Maximum number of bytes read for any task |
AvePages |
Average number of page faults for all tasks |
MaxPages |
Maximum number of page faults for any task |
A job that reads or writes excessively will be bogged down significantly by I/O.
Examples
To get a general idea of how efficiently a job utlized its resources, the following format can be used:
JobID,JobName,Elapsed,NCPUs,TotalCPU,CPUTime,ReqMem,MaxRSS,MaxDiskRead,MaxDiskWrite,State,ExitCode
Use this format to look at all jobs named 'julia-lang-ftw' that ran with a wall time of at least 2 days and were killed for running out of memory in the past 3 months:
sacct --name julia-lang-ftw --starttime $(date -d '3 months ago' +%D-%R) --state OUT_OF_MEMORY --timelimit-min 2-00:00:00 \
--format JobID,JobName,Elapsed,NCPUs,TotalCPU,CPUTime,ReqMem,MaxRSS,MaxDiskRead,MaxDiskWrite,State,ExitCode
Last changed on Thu Oct 12 10:43:43 2023