| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zippyman55 101 days ago
	My team was responsible for the system administration of a large scale HPC center. We seemed to get blamed, incorrectly, for a lot of sloppy user code. I implemented statistical process controls for job aborts, and reported the results as mean time to failure rates over the years. It was pretty cool, as I could respond with failure rates for each of several thousand different programs. What did not work was changing the culture to get people to improve their code. But I was able to push back hard when my team was arbitrarily blamed for someone else’s bad code. It was easy to show that a jobs failure rate was increasing and link it to a recent upgrade or change. But, I felt I was often just shining the flashlight at an issue and trying to encourage a responsible party to take ownership.

1 comments

gabdiax 101 days ago

That's really interesting. Using statistical process control for failure rates in HPC systems sounds like a very solid approach.

In your experience, were there usually early signals in metrics before job failures increased? For example patterns like latency changes, resource saturation or network anomalies.

I'm trying to understand whether those signals appear consistently enough to detect issues before incidents actually happen.

link

zippyman55 100 days ago

For the mean time to failure, I based it on a section out of Mastering Statistical Process Control, By Tim Stapenhurst. Specifically, The section on using SPC to measure earthquakes, etc. The system worked pretty well, ran for years, and using R, I built a free system to monitor all the job schedule information for our HPC systems. I’d present the most egregious information in the form of a daily Pareto chart. I’d attempt to shame the code owners when they would appear at the top of the Pareto chart. But, mostly, I just did not want people having their go-to excuse of blaming the system administrators, when it was really their recent code update. There were other SPC charts, which one could drill down and look at job run times, or which nodes the jobs ran on, etc. But working the culture to get people to be responsible for their applications was a little out of my wheelhouse, and always a challenge. For those few people who really embraced their application ownership and wanted to make sure things ran well, it that was always nice. It was always nice to say something like, “your job used to crash 3 times a year and now it seems to be crashing 6 times a year.” At least, we would have a good point to discuss potential causes. I know some of the developers got sucked into tools like Splunk, but to me, that was always cost prohibitive for our budget and our volume of data. Answering your question about “early signals in metrics before job failures increased” the mean time to failure SPC chart would show a job failure signature and if there were problem nodes, or problems with a software update, that would become apparent to allow further investigation. The other SPC charts, like job run time would show things like increased job run time, etc. But, that was pretty basic stuff (and lots of tools can do that stuff), such as a user was generating a daily tar-file, which was growing over time and eventually filling up a file system, etc. But getting people to take action always seemed so hard.

link

gabdiax 97 days ago

I’m building EventSentinel.ai, a predictive AI platform that monitors hardware and network infrastructure to detect early signals of failures and connectivity issues before they cause downtime.

I’m looking for a few early-stage design partners (SRE / DevOps / IT / Network teams) who:

Manage on‑prem or hybrid infrastructure with critical uptime requirements

Are currently using tools like Datadog, PRTG, Zabbix, or similar, but still deal with “surprise” incidents?

Are open to trying an MVP and giving candid feedback in short feedback sessions?

What you’d get:

-Early access to our predictive failure and anomaly detection features

-Direct influence on the roadmap based on your needs

-Free usage during the MVP phase (and preferential terms later)

If this sounds relevant, drop a comment “interested” and I’ll follow up with details or email at gabriele@eventsentinel.ai

link