Hacker News new | ask | show | jobs
by nitwit005 966 days ago
None of the layers of abstraction are perfect. You have to deal with the whole mess all the way down.

We've had individual EC2 instances go bad where I currently work, with Amazon acknowledging a hardware problem after a ticket is raised. The reality is, quickly resolving the issue means detecting it and moving off of the physical machine.

Naturally our tooling has no convenient way to do that, because we have layers of things trying to pretend physical machines don't matter.

1 comments

No the answer is keeping all of your VMs stateless and just using autoscaling with the appropriate health checks. Even if you just having a min/max of 1.
Describe a health check that can detect any possible hardware problem.

The error rate on the machines was higher in both cases, but many requests still succeeded. Amazon certainly didn't detect an issue right away either.

There is no way that you could record metrics - even custom metrics that get populated via the CloudWatch logs agent to CloudWatch and over a certain threshold of errors, bring another instance up and kill the existing instance? If you could detect sporadic errors there must be some method to automated it.

I’m assuming this isn’t a web server, if so it’s even simpler.

A statistical rule moves you into the realm of deciding what rate of false positives and false negatives you'll tolerate. Based on data from exactly two incidents in this case, which is obviously a bit fraught.