The same storage system that gave us problems on Wednesday hung again Friday evening. This time we did not lose any nodes. We know what is causing the issue now and are working with the vendor on a software fix. Running jobs had time added to compensate for the hang. Last Updated Friday, Jan 17 06:50 pm 2020

My MPI job appears to run but never produces output. Is it actually running?

It is possible that there is a problem (software or hardware) on one of the nodes that your jobs was assigned. If that is the case, please open a support ticket so that an FSL admin can mark the node offline.

One way to diagnose if your job started on all the nodes it was assigned is to run the following command (assuming job ID 12345):

rjobstat 12345

Maybe the job didn't spawn on all nodes correctly. Maybe some of them are deadlocked. Please ask for help if you need it.