| > Have you even run a production service? Yes, for many years I was in the core team responsible for a service important enough in my country that if it's down for 30 minutes it makes the national newspapers. Some million users. And main lesson from that experience is: If you are going to fail, make sure you fail as fast as possible. Then failure happens during work hours and you can usually do a simple rollback (1) to the previous version of the service -- sometimes that rollback will even happen automatically if the failure happens quickly enough. The worst cases and longest downtimes came from performance problems and/or suddenly changing query plans that only crept up on us slowly and perhaps hit during traffic spikes (which in our case would happen during holidays). -- (1) Yes I know you said in your example you did something non-reversible in between. But our rollouts would often be through flags and % of traffic, not so much code version. Also, in practice with our traffic volumes, either failure would be soon enough that you didn't have time to do that other non-reversible thing in between before you went down, OR if it happens "seldom" it can just be down until you are able to roll forward; still less disruptive to get the problem right away than to suddenly get it after a year. I guess YMMV. Again what I'm proposing is an optional hint, so if you don't do gradual rollout of traffic on new features, if you don't have high test coverage, etc etc one could simply not use it. But I know for sure it would be useful in our specific context. |