Bunya HPC on UQ Statushttps://status.its.uq.edu.au/affected/bunya-hpc/Incident historygithub.com/cstateen2023-09-04T00:25:00+00:002023-09-04T00:25:00+00:00[Resolved] Scratch filesystem on bunya unavailable - job failureshttps://status.its.uq.edu.au/issues/19215/Mon, 04 Sep 2023 00:25:00 +0000https://status.its.uq.edu.au/issues/19215/2023-09-04 14:38:00RCC are aware of an issue with the high performance storage sub-system that provides access to /home /scratch and /sw and $TMPDIR. Clients may experience: very slow response to commands, stale filehandle error messages job failures in certain circumstances RCC staff are working with the backline vendor to resolve the issue as soon as possible and return all nodes to regular operations. Once normal operations resume, clients should be encouraged to check the status of their jobs and re-submit if required.<p>RCC are aware of an issue with the high performance storage sub-system that provides access to /home /scratch and /sw and $TMPDIR.</p> <p>Clients may experience:</p> <ul> <li>very slow response to commands,</li> <li>stale filehandle error messages</li> <li>job failures in certain circumstances</li> </ul> <p>RCC staff are working with the backline vendor to resolve the issue as soon as possible and return all nodes to regular operations.</p> <p>Once normal operations resume, clients should be encouraged to check the status of their jobs and re-submit if required.</p> <hr> <p><strong>Update</strong> Job queues on Bunya have been <em>re-enabled</em> and a work-around is now running in place to mitigate the bug that has triggered the failures over the weekend.</p> <p>A further update will be provided once a full patch has been provided by the vendor. <span class="faded">(11:57 AEST — Sep 4)</span> </p> <p><strong>Resolution</strong> A vendor supplied patch has been applied to the storage service. Bunya is now stable and jobs are running. <span class="faded">(14:38 AEST — Sep 4)</span> </p>