To all chip users,
The DoIT Research Computing & Data (RCD) Team is continuing to work on the storage system performance issues that have been impacting the chip cluster since early February. For some users this is intermittent or not noticeable, for others the issue is persistent. When present, these issues are manifesting as unusually long file read/write/load times for operations as simple as "ls" or "cd" when navigating the filesystem or as complicated as long slurm job runtimes.
While the RCD Team continues to optimize the RRStor Ceph File Storage System to meet the needs of the diverse user base, certain research workflows have been identified as specifically vulnerable to low performance. These workflows are those which feature the reading and/or writing of many (more than one hundred thousand) small (10s of KB or less) files or their enclosing directory from the storage system.
In light of this, the RCD Team has identified a number of research groups whose job runtimes have been adversely affected and are beginning to reach out to group PIs and group members to assess cluster workflows.
Other types of standard cluster operations (e.g., software loads) are still generating long wait times, and the RCD Team will continue to investigate and propose solutions to these and similar problems.
We will continue to give weekly updates and give substantive updates as they develop.
Roy Prouty
Assistant Director for Research Computing
UMBC DoIT