|
|
|
|
|
by gabdiax
101 days ago
|
|
That's really interesting. Using statistical process control for failure rates in HPC systems sounds like a very solid approach. In your experience, were there usually early signals in metrics before job failures increased? For example patterns like latency changes, resource saturation or network anomalies. I'm trying to understand whether those signals appear consistently enough to detect issues before incidents actually happen. |
|