|
|
|
|
|
by zippyman55
101 days ago
|
|
My team was responsible for the system administration of a large scale HPC center. We seemed to get blamed, incorrectly, for a lot of sloppy user code. I implemented statistical process controls for job aborts, and reported the results as mean time to failure rates over the years. It was pretty cool, as I could respond with failure rates for each of several thousand different programs.
What did not work was changing the culture to get people to improve their code. But I was able to push back hard when my team was arbitrarily blamed for someone else’s bad code.
It was easy to show that a jobs failure rate was increasing and link it to a recent upgrade or change. But, I felt I was often just shining the flashlight at an issue and trying to encourage a responsible party to take ownership. |
|
In your experience, were there usually early signals in metrics before job failures increased? For example patterns like latency changes, resource saturation or network anomalies.
I'm trying to understand whether those signals appear consistently enough to detect issues before incidents actually happen.