Hi everyone,
I'm writing to give everyone an update regarding the UMBC HPCF and other Research Computing projects currently undertaken by DoIT.
Summary
-
Critical Vulnerabilities Patched
-
Users Can Configure Email Notifications from Slurm
-
Coming Soon: “develop” Partition on chip
-
Performance Improvements on RRStor
-
ada Volumes Being Sunset by June 30
-
Reminder: chip Maintenance Period on June 2nd
Critical Vulnerabilities Patched
In May, a few major vulnerabilities were disclosed that affected all major Linux distributions. In short, these flaws could allow a standard user to improperly gain full administrative control of a system. Our team quickly installed the necessary security patches and protective mitigations to prevent malicious users from taking advantage of these exploits.
Users Can Configure Email Notifications from Slurm
Users can now supply additional sbatch/srun/salloc flags to their normal slurm scripts or commands to instruct slurm to send notification emails to specified email addresses. These notifications can trigger based on job status and other slurm allocation metrics, e.g., Job Start/End or Job 90% through timelimit. Please see the “--mail-type” and “--mail-user” flags in the “man” pages for these slurm commands for more information.
Coming Soon: “develop” Partition on chip
In an effort to better support the development and testing of slurm jobs on chip, we will be adding a new “develop” partition. This dedicated partition is specifically designed for users who need to test code interactively, allowing users to quickly check if changes worked without waiting in the main job queues. Jobs running on this partition will have a short time limit (up to 90 minutes) to keep queue times low and maximize availability for rapid troubleshooting.
We aim to make this available following the chip maintenance period on June 2nd.
Performance Improvements on RRStor
This past month, the team has been fine-tuning the RRStor ceph storage cluster behind the scenes to make it faster, more stable, and self-healing. In particular, we have automated how the system handles disconnected users to prevent file lockups, and improved our early-warning monitoring to catch hardware issues before they cause issues that could lead to widespread slowdowns.
For more information, check out this myUMBC post: Storage System Performance Update.
If you have noticed performance changes that may be related to storage, please submit a descriptive RT ticket, and DoIT Research Computing Staff will work with you to investigate and suggest ways to speed-up your workflows.
ada Volumes Being Sunset by June 30
Given the age of the ada NFS sever, the 170TB spread across 42 volumes will be migrated to RRStor in careful coordination with those volume owners. Please keep an eye out for emails from DoIT Research Computing Staff to kick-off this process ahead of June 30.
Reminder: chip Maintenance Period on June 2nd
This is a reminder that on Tuesday June 2nd, chip will be inaccessible during business hours (ET) to undergo maintenance. Refer to this post for more information: 20260602 Downtime Announcement
Publications
If you have any publications, presentations, theses, or other works that made use of the campus cluster, please submit an RT Ticket with bibliographic information so that we can accurately reflect this work in our records and on the HPCF Website.
Need Help?
As always, please communicate any issues/questions to the Research Computing RT Queue (hpcf.umbc.edu > User Support > Request Help).
--
Gregory Ballantine
HPC Specialist
UMBC DoIT