Due to memory consumption on the taki login node, `taki-usr1`, we are rebooting the node to clear associated errors.
When free memory on one of our machines drops below a certain threshold, a special system process starts to kill processes based on a set of industry-set heuristics. The demise of these processes could lead to unexpected machine behavior, necessitating a reboot of the machine.
I'd like to take this opportunity to remind all taki users that the login node is used by all taki users. This node should be used to access files, write and compile code, and submit jobs to the slurm scheduler. Tests of code or workflows should be delegated to the develop partition. More information on how to use these nodes is located on the HPCF Website (https://hpcf.umbc.edu/cpu/how-to-run-programs-on-taki/).
Choosing to look at `squeue` as an indicator on this issue. Squeue is now responding but with an added message that "MessageTimeout is too high for effective fault-tolerance". This is expected. I increased the MessageTimeout configuration parameter in an attempt to give more time for the slurm database to respond. This seems to be working, but it leaving us with an authentication issue.
Looking at the output of squeue, there are many tens of thousands of similar jobs being run by some users. I am cancelling these and will follow-up with the users and PIs (where necessary). I will also discuss with our slurm support how to configure slurm to avoid this situation in the future.
HPC System Administrator
Research Computing Specialist, UMBC DoIT
Office: (410) 455-6351