Getting Started with GPUs

GPUs We have H200s, L40Ss, H100s, P100s, A100s, and V100s. The list is maintained under the Documentation menu, System Access -> Compute Resources. Entries that say "Preemption only" are owned by certain departments and faculty but are available through preemption.

Using GPUs in Jobs

Using Slurm, you can request GPUs using the --gpus flag. Be sure to consider how many GPUs your program can utilize, and adjust your request accordingly; for example, if your program only uses one GPU and 6 processors, you might use --nodes=1 --ntasks=6 --mem=12G --gpus=1. Additionally, the environment variable CUDA_VISIBLE_DEVICES will list, as a comma-separated string, the CUDA devices available to your job. You can also specify particular GPU models, such as H200s or L40Ss, by using options like --gpus=h200:1 or --gpus=l40s:1 respectively, where the :1 denotes the number of GPUs you need.

If you want to access more GPUs, running as preemptable can give you access to private hardware. This opens up more GPU types and can often reduce time spent waiting in the queue. In exchange, your jobs might be terminated early and should therefore use checkpointing. External checkpointers for GPUs are often cumbersome to use. Most of our users find it easier to write checkpointing directly into their code. Many libraries, like PyTorch, have builtin checkpoint functionality and only require slight code modifications.

To compile or run CUDA code, you'll need the CUDA libraries and runtime; get them in your path by doing: module load cuda.

For interactive development or for compiling CUDA programs, you will need to request an interactive job using salloc. The salloc program accepts the same flags as sbatch, but you must provide them on the command-line since salloc isn't given a file to run (it gives you a shell instead).

Job Restrictions

Our current GPU restrictions can be found here. They vary depending on which cluster you're requesting. These are subject to change, and administrators may impose additional restrictions as they see fit based on demand.

GPU Profiling

GPU profiling is essential for ensuring that computational resources are utilized efficiently and for diagnosing performance bottlenecks.

Nvidia-smi

One of the most fundamental tools available is nvidia-smi, a command-line utility provided by NVIDIA. This tool allows users to query the number, model, and status of GPUs available to the user on a node. It's a great check to ensure you successfully requested the GPUs you were after. It works well on both sbatch and salloc jobs.

Nvtop

For more dynamic, real-time monitoring, nvtop (NVIDIA top) offers an interactive, terminal-based interface that functions similarly to the top or htop command. Run module load nvtop; nvtop to pull up the graphs. nvtop provides live metrics on GPU activity, displaying detailed information on running processes, memory consumption, power draw, and utilization percentages. By monitoring the GPUs' behavior as your job run, you can identify resource contention, understand load distribution (are you really running on all 4 GPUs?), and adjust your workflow to improve overall system efficiency. While you have a job running on a compute node, you have the ability to ssh there. Run squeue --me on a login node to find the name of the node your job is on. You can then ssh node_name for the duration of the job. It's often helpful to have an salloc session in one terminal and nvtop running in the other for live debugging purposes.

Software Specific Tools

When working specifically with deep learning frameworks like PyTorch, PyTorch's memory profiler and performance profiler tools become invaluable. Tools like this enable you to track memory usage and performance issues throughout your model's training or inference processes. The tools are best used in an sbatch job since they require making slight tweaks to your code and analyzing it over the full runtime.

If you come across another great profiler, please let us know. Note: we cannot have NVIDIA Nsight Systems or Nsight Compute available to our users due to security concerns.

Last changed on Mon Apr 14 13:32:22 2025