| Great post. This one always brings a smile to my face: > Every component is crash-only I was part of the team that developed a distributed, five-9's control system for an industry where downtime costs millions per minute and comes with a federal investigation if long enough. On top of that, the industry is made up of competitors that explicitly distrust each other, so all components had to be truly distributed, with no central coordination for anything. Given the requirements we decided to explicitly adopt a crash-only approach. Between idempotent operations, horizontal scaling, and fast restart times, we could make failing components not impact SLAs (and we had testing to ensure it). Once it gets out into the field (which because of how risk adverse this industry is, is measured in years), it turns out they really did not like software crashing. They interpreted crashing as bad quality, and no amount of "we do it on purpose to ensure correctness" was going to make them happy. |