Grace Hopper
Hardware Specifications
There are two GH200 Grace Hopper nodes. The specifications for each node are:
- Processor Family: NVIDIA GH200 Grace Hopper Superchip
- Number of Processors: 1
- Processor Type: 72 core NVIDIA Grace ARM Neoverse V2
- GPU: 1 x NVIDIA H100
- Internal Interconnect: NVIDIA NVLink-C2C 900GB/s
- System Memory: 480 GB LPDDRX
- GPU Memory: 96 GB HBM3
- Memory accessible from GPU: All 576 GB of GPU and system memory is accessible from the GPU
- InfiniBand: HDR200 (operating at 100 Gb/s due to upstream switches)
CPU Architecture: It's very different
These nodes use ARM, a different CPU architecture than the x86/x86_64 used by AMD and Intel. This means that code needs to be compiled for ARM. Scripts should generally still work.
This particular ARM architecture is usually labeled as aarch64
in Linux.
Batch jobs
The job needs to be submitted from a RHEL 9 node, such as rhel9ssh.rc.byu.edu
To use these nodes, your job submissions must include:
- Either
-C arm
or--constraint=arm
- A GPU request using an argument like
--gpus gh200:1
Walltime limit: 1 day, subject to change.
Operating System
The OS will be our Red Hat Enterprise Linux 9 image.
Support
These nodes will have minimal software support. It takes a lot of work to maintain the software image and applications that we do, and we do not anticipate performing that same level of effort for ARM systems like these unless we make a larger investment in ARM products. That is not to say we won't support them, just that you can expect much less time and effort from staff on these specialty nodes. Ask us if you have questions, just be aware that we may not be able to dedicate the effort to help with particular applications.
Many users interested in these nodes are likely using libraries like PyTorch that should just work, or may just need minor work. Libraries like PyTorch and other very common applications are where we will focus our support effort. As of June 20, 2024, there is almost no software installed, not even PyTorch. We are beginning work on installing some software.
Code Compilation
In the future, we will add information here about recommendations for which compilers and flags to use.
GH200 vs H100 vs H200
NVIDIA's naming system is confusing. "Grace" (the "G" in "GH200") refers to the CPU generation. "Hopper" (the "H" in "GH200") is the GPU generation. The numbers tacked onto the end appear to be specific to the combination of CPU and GPU. In this case, the GH200 contains an H100 GPU along with a 72-core Grace CPU, which is an NVIDIA ARM CPU. The GH200 does not contain an H200 GPU, which is a different product. Bigger numbers do not always mean better things, such as with the H800 which is a crippled version of the H100.
Last changed on Mon Jun 24 09:18:27 2024