Jobs Underutilizing Resources
You can charter a fleet of buses for your next road trip but that doesn't mean that the buses will actually be used, especially if you don't invite anyone to go with you. Similarly, you can schedule hundreds of nodes but not use them, especially if you didn't configure your application properly.
Why is this important?
When you over-request resources, you deprive other users of those resources when you aren't even using them. Please try your best to request the resources that you actually need.
Why does this happen?
Your program must be capable of using all the resources you request. The scheduling system does not magically make your program launch more than one thread/process if it is not capable of it. Some programs cannot launch multiple threads or processes because they were not designed to do so.
Some programs are capable of parallelism but it must be specified with a command line parameter such as "-n", "-c", "-p", etc. Other programs use environment variables such as OMP_NUM_THREADS. Some programs require configuration file changes and others must be launched using something like mpirun.
Sometimes you may forget that you told your application to launch 4 tasks but you requested 16 from the scheduler. Please double check that your request matches the program configuration.
Additionally, multi-threaded programs do not communicate between multiple nodes. Only programs that use MPI or something similar can communicate between multiple nodes.
Please contact us if you need assistance.
What if you just need those resources?
You may need those resources, but if you aren't using them, you aren't using them. If an administrator tells you that you aren't using the allocated resources, you either 1) don't need those resources in the first place or 2) won't get your work done like you hoped. Please fix your program to use the allocated resources or stop requesting resources that your program won't use.
How to check running jobs
Run the rjobstat command from a login node. Example: rjobstat 123456.
Example output of a job that doesn't utilize the allocated resources:
$ rjobstat 123456
Node | Memory (GB) | CPUs
Hostname Alloc Max Cur Alloc Used Eff%
m7-4-4 16.0 0 0 16 0.00 0
m7-4-10 16.0 0 0 16 0.00 0
*m7-4-2 16.0 2.8 1.2 16 0.99 6
m7-4-5 16.0 0 0 16 0.00 0
* denotes the node where the batch script executes (node 0)
CPU usage is cumulative since the start of the job
In the example above, the job requested 4 nodes with 16 cores each (CPUs:Alloc column) and 16 GB of memory per node (Memory:Alloc column). Not all jobs have homogeneous layouts so the allocated CPUs and memory amounts can be different per node.
Node m7-4-2 has a CPU "efficiency" of 6% (CPUs:Eff% column) and only uses about .99 of a CPU core (CPUs:Used column). This means that approximately one process is using one CPU core about 100% of the time. The job requested 16 cores on that node but is only using 1 for an "efficiency" of 6% (1/16 ≈ 0.06).
The other cores and nodes are completely unused.
m7-4-2 has used a maximum of 2.8 GB of memory at any point in time over the course of the job. It is currently using 1.2 GB. Since the other nodes have no tasks on them, they have no memory usage from your job.
For further investigation, you can ssh into each node and run top -u $USERNAME, ps -ef, and other commands. Note that if you have multiple jobs on a node you may see processes from each of those jobs. squeue -u $USERNAME -w $HOSTNAME will show you all jobs you have on a particular node.
If you happen to use all the memory allocated to your job, the job will run very slowly and it will likely result in a drop in CPU usage as memory has to be swapped back and forth from disk. The job will be terminated if it exceeds the allocated memory by more than a little.
How to check past jobs
Some of the following commands may be useful:
sacct -o all -u $USERNAME | less -S
sacct -o all -j $JOBID | less -S
man sacct
These are a little hard to interpret but give useful information. A much easier way is to view your jobs at on the Job Statistics page in My Account.
How to fix it
Read the documentation for your program and contact us if you need assistance. Do not request more resources than you need.
Last changed on Fri Jul 22 08:34:04 2016