Hi everyone,
I'm writing to give everyone an update regarding the UMBC HPCF and other Research Computing projects currently undertaken by DoIT.
Summary
-
Critical Vulnerability Patched
-
Slurm bug and misuse of GPUs
-
GPU machines currently down
-
Reminder of gpu-general partition
-
Slow read/write files on RRStor
Critical Vulnerability Patched
In April, a critical vulnerability was announced that affects every Linux distribution since 2017. As a result, chip and other research machines had to take an emergency shutdown on April 27, 2026 to patch the affected machines, and resulted in the reboot of all login nodes and compute nodes. Jobs were killed as a result. If you have questions about how this may have affected you please submit an RT Ticket.
Slurm bug and misuse of GPUs
Earlier this month, it came to our attention that users could gain access to more than initial requested GPUs from slurm by remotely accessing nodes running active slurm allocations and manually changing environment variables, resulting in jobs that were expecting more VRAM than was available, and causing jobs to die with out-of-memory (OOM) errors. This bug has been fixed. As a reminder, users should abide by the constraints set out by slurm. Any efforts to subvert the use of slurm constraints may result in a lock being placed on your account.
GPU machines currently down
We are aware that the following machines are currently down: g24-07, g20-03, g20-10, g20-13. Please bear with us as we attempt to make repairs to these machines.
Reminder of gpu-general partition
Recently, we’ve received multiple tickets asking why jobs were being preempted before the three day limit. This has been in large part due to the use of the “gpu” partition, and increased interest in the 2024 GPUs. As a reminder, back in September, we rolled out the use of dedicated access partitions for GPU users. This means that jobs placed on any of these partitions may be preempted before the three day grace period if a contributor wants access to that machine. If you wish to avoid this, you may use the “gpu-general” partition, which guarantees you will only be placed on to nodes that will give you the three day minimum runtime.
Slow Read/Write Files on RRStor
Be advised, we are aware of the ongoing issues with the RRStor filesystem and are actively working with our vendor to resolve the issues. For more information, check out this myUMBC post: “Storage System Performance Update.”
Publications
If you have any publications, presentations, theses, or other works that made use of the campus cluster, please submit an RT Ticket with bibliographic information so that we can accurately reflect this work in our records and on the HPCF Website.
Need Help?
As always, please communicate any issues/questions to the Research Computing RT Queue (hpcf.umbc.edu > User Support > Request Help).
As always, keep computin’!
Max Breitmeyer,
HPC System Administrator