We are aware that slurm commands such as squeue and sacct are running slowly or even timing-out this morning. We are investigating the issue. Slurm documentation seems to indicate that this may be due to a large number of concurrent jobs in addition to frequent queries to the slurm database.
Running jobs seem to be unaffected by this issue. Though some may need to be cancelled to resolve the apparent load on slurm.
I will follow up by editing this post with the resolution.
I can confirm that the issue is a high number of jobs (near 50K, total) pending in the slurm queue. This generated a high load for the database when queries were made such that the remote calls over the network (from any node to the management node) timed-out before they could be completed.
By increasing the time the network considers for a 'time-out' I was able to finally gather information from the slurm database and administratively cancel a few tens of thousands of jobs. I am working now with slurm support to determine the best way to limit the total number of jobs on the queue. This will mean that future sbatch submissions will generate errors when attempting to submit a number of jobs that would put the slurm queue above the limit. This is preferable to squeue being unavailable to all users due to high load.
HPC System Administrator
Research Computing Specialist, UMBC DoIT
Office: (410) 455-6351