Getting Started With GPUs
The m9g cluster has 40 nodes with GPUs, each with 4 NVIDIA P100 GPUs. Each node has 128 GB of memory and the same CPUs as the m9 cluster. The P100's are a significant improvement over the K80s of the m8g cluster, which in turn are a significant improvement over the previous GPUs.
The m8g cluster has 32 nodes with GPUs, each with 2 NVIDIA K80's. Note that the devices report as 2 GPUs each, however. This means if you query the system it will report that there are 4 K80 GPUs available. Each node has 64 GB of memory; the CPUs match the m8 cluster exactly.
For more detailed information about the GPU hardware, see NVIDIA's website.
Using GPUs in Jobs
You can request GPUs in Slurm using the
--gres feature. As an example you can request a whole node and its 4 GPUs by including the following in your
--nodes=1 --mem=64G --exclusive --gres=gpu:4. Please be aware of how many GPUs your program can use and request accordingly. If you can only use one GPU and 6 processors use:
--nodes=1 --ntasks=6 --mem=12G --gres=gpu:1. The environment variable
CUDA_VISIBLE_DEVICES will contain a comma separated list of CUDA devices that your job has access to.
To compile or run CUDA code you'll need the CUDA libraries and runtime; get them in your path by doing:
module load cuda.
For interactive development or for compiling CUDA programs, you will need to request an interactive job using
salloc program accepts the same flags as
sbatch but you must provide them on the command-line since
salloc isn't given a file to run (it gives you a shell instead).
Here are our current GPU restrictions:
- Walltime of 3 days
- No more than 26 GPU nodes total
- No more than 26 GPU jobs at any time
- Your jobs can escape all limits (up to 7 days walltime) by becoming preemptable
These are subject to change, and administrators may impose additional restrictions as they see fit based on demand.