| HN Mirror

For the mean time to failure, I based it on a section out of Mastering Statistical Process Control, By Tim Stapenhurst. Specifically, The section on using SPC to measure earthquakes, etc. The system worked pretty well, ran for years, and using R, I built a free system to monitor all the job schedule information for our HPC systems. I’d present the most egregious information in the form of a daily Pareto chart. I’d attempt to shame the code owners when they would appear at the top of the Pareto chart. But, mostly, I just did not want people having their go-to excuse of blaming the system administrators, when it was really their recent code update. There were other SPC charts, which one could drill down and look at job run times, or which nodes the jobs ran on, etc. But working the culture to get people to be responsible for their applications was a little out of my wheelhouse, and always a challenge. For those few people who really embraced their application ownership and wanted to make sure things ran well, it that was always nice. It was always nice to say something like, “your job used to crash 3 times a year and now it seems to be crashing 6 times a year.” At least, we would have a good point to discuss potential causes. I know some of the developers got sucked into tools like Splunk, but to me, that was always cost prohibitive for our budget and our volume of data. Answering your question about “early signals in metrics before job failures increased” the mean time to failure SPC chart would show a job failure signature and if there were problem nodes, or problems with a software update, that would become apparent to allow further investigation. The other SPC charts, like job run time would show things like increased job run time, etc. But, that was pretty basic stuff (and lots of tools can do that stuff), such as a user was generating a daily tar-file, which was growing over time and eventually filling up a file system, etc. But getting people to take action always seemed so hard.