BYU

Office of Research Computing

Using MPI

Many applications make use of MPI in order to distribute tasks across many nodes. When applicable, we encourage our users to use MPI.

Running MPI binaries

It is not enough to just compile a binary against the MPI libraries. In order to distribute a job across multiple processors and/or nodes, the binary must be called with either srun or mpirun, with srun being preferred.

If your binary is named 'foobar' and it is compiled correctly, either of these should work (assuming the correct MPI module is loaded for mpirun):

srun foobar
mpirun foobar

Requesting processors with MPI

When launched without extra arguments, both srun and mpirun will try to start one MPI rank per Slurm task requested. A request with --ntasks=4 will result in 4 MPI ranks on 4 processes.

It is common for small MPI jobs to want to guarantee that they run on a single node for performance reasons. This can be accomplished by adding --nodes=1 to the sbatch flags, e.g.

#SBATCH --nodes=1 --ntasks=4
#SBATCH --mem-per-cpu=2G
#SBATCH --time=1:00:00 --hint=compute_bound

srun foobar

Multithreading with MPI

Sometimes MPI programs also use OpenMP or other mechanisms for multithreading. In these cases, you want to use -c/--cpus-per-task to set how many threads per process each rank should have. Whether it is better to use more threads per task or more tasks with fewer threads per task is highly dependent on the program. However, in general it seems that MPI + Threading seems to perform better than purely MPI as the problem sizes increases.

Here is an example for launching 8 MPI tasks on 4 nodes with 14 threads per task on our m9 system for a program named foobar:

#SBATCH --nodes=4 --ntasks=8 --ntasks-per-socket=1 --cpus-per-task=14
#SBATCH -p m9 --mem=128G
#SBATCH --time=1:00:00 --hint=compute_bound

# Needed if the program uses OpenMP
export OMP_NUM_THREADS=14

srun foobar

This setup is theoretically ideal, as each m9 node has 2 CPU sockets, with 14 cores per socket. So this launches 1 MPI task per socket, and uses multithreading to take advantage of the 14 cores in each socket, for a total of 224 processors. Whether this is actually ideal heavily depends on the program.