Hacker News new | ask | show | jobs
by andrewaylett 1068 days ago
I've been happily running a service that's non-critical, only to discover when we have an outage (that should be a non-event) that another team has started relying on it for something business critical.
1 comments

This was famously a problem for Google's distributed lock service, Chubby. They handled it by intentionally having outages to flush out ways it might have started to bear loads it wasn't designed for: https://sre.google/sre-book/service-level-objectives/#xref_r...
I'm a fan of the 'chaos monkey' (Netflix software) approach of this.

Can't expect your platform to be reliable, if it just breaks at random.