Hacker News new | ask | show | jobs
by jegoodwin3 3393 days ago
I agree testing and automation are good. I think they need to go beyond this to formal verification, for something on this scale and reliability. NASA doesn't make these sorts of mistakes.

By the way - this is not just Amazon's problem now. We know the internet has a single point of failure. So does a lot of IoT.

When will we experience the first Suicide DevOps?

3 comments

https://www.youtube.com/watch?v=6OalIW1yL-k

(Specifically https://www.youtube.com/watch?v=6OalIW1yL-k#t=3m but it's worth watching the whole clip (or even the whole movie) if you haven't seen it before. It's from Terry Gilliam's "Brazil".)

Almost twenty years ago, though.
Well, they've had plenty of opportunities to learn from their mistakes; Amazon hasn't had this long.
>We know the internet has a single point of failure.

It has? I have yet to see the day where I can neither reach my email provider nor Google nor Hackernews. My local provider might screw up occasionally, or some number of of websites go unreachable for whatever reason. But I fail to come up with anything short of cutting multiple see cables that causes more than 50% of servers to be unreachable to more than 50% of users.

Amazon do formally verify AWS (they use TLA+), which is probably why this failure is a human error. Of course, you could expand the formal analysis of the system to include all possible operator interactions, but you'll need to draw the line at some point. NASA certainly makes human errors that result in catastrophic failures. The Challenger disaster was also a result of human error to a large degree[1]; to quote Wikipedia: "The Rogers Commission found NASA's organizational culture and decision-making processes had been key contributing factors to the accident, with the agency violating its own safety rules."

[1]: https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disas...