Hacker News new | ask | show | jobs
by nitwit005 970 days ago
Describe a health check that can detect any possible hardware problem.

The error rate on the machines was higher in both cases, but many requests still succeeded. Amazon certainly didn't detect an issue right away either.

1 comments

There is no way that you could record metrics - even custom metrics that get populated via the CloudWatch logs agent to CloudWatch and over a certain threshold of errors, bring another instance up and kill the existing instance? If you could detect sporadic errors there must be some method to automated it.

I’m assuming this isn’t a web server, if so it’s even simpler.

A statistical rule moves you into the realm of deciding what rate of false positives and false negatives you'll tolerate. Based on data from exactly two incidents in this case, which is obviously a bit fraught.