Hello all.
Thank you for your patience while we sorted through the issues with Taki. We believe the cluster is now back online and working normally. The source of the issue was with bad connectivity on the Infiniband network. We've been able to narrow down and replace the bad line that prevented use of various nodes and storage.
Thank you for your patience while we sorted through the issues with Taki. We believe the cluster is now back online and working normally. The source of the issue was with bad connectivity on the Infiniband network. We've been able to narrow down and replace the bad line that prevented use of various nodes and storage.
If you notice any issues moving forward, please do let us know. We will be keeping a close eye on the cluster and jobs to ensure the problem does not again crop up.
Tim Champ
Tim Champ
Manager of Unix Infrastructure
DoIT