|
|
|
|
|
by ibdknox
4867 days ago
|
|
I'm not sure that should be a concern at the routing layer or even necessarily a concern of heroku. It's not their job to ensure that your code isn't blowing up. That being said, health checks are nice for other reasons and could be used outside of the routing layer (which you need to sail along as quickly as possible). |
|
I don't know how big they are. 50k machines? Could be off by an order of magnitude either way but I'll go with that. Suppose that your servers have, let's be generous, a 5 year mean time between failure. That's 10k machines dying every year. About 27 per day. A bit over 1 per hour.
Machines don't necessarily die cleanly. They get flaky. Bits flip. Memory gets corrupted. Network interfaces claim to have sent data they didn't. Every kind of thing that can go wrong, will go wrong, regularly. And every one of them opens you up to following "impossible" code paths where the machine still looks pretty good, but your software did something that should not be possible, and is now in a state that makes no sense. Eventually you figure it out and pull the machine.
Yeah, it doesn't happen to an individual customer too often. But it is always happening to someone, somewhere. And if you use least connections routing, many of those failures will be much, much bigger deals than they would be otherwise. And every time it happens, it was Heroku's fault.