|
Complex systems are really really hard. I'm not a big fan of seeing all these folks bash AWS for this, and not really understanding the complexity or nastiness of situations like this. Running the kind of services they do for the kind of customers, this is a VERY hard problem. We ran into a very similar issue, but at the database layer in our company literally 2 weeks ago, where connections to our MySQL exploded and completely took down our data tier and caused a multi-hour outage, compounded by retries and thundering herds. Understanding this problem under the stressful scenario is extremely difficult and a harrowing experience. Anticipating this kind of issue is very very tricky. Naive responses to this include "better testing", "we should be able to do this", "why is there no observability" etc. The problem isn't testing. Complex systems behave in complex ways, and its difficult to model and predict, especially when the inputs to the system aren't entirely under your control. Individual components are easy to understand, but when integrating, things get out of whack. I can't stress how difficult it is to model or even think about these systems, they're very very hard. Combined with this knowledge being distributed among many people, you're dealing with not only distributed systems, but also distributed people, which adds more difficulty in wrapping this around your head. Outrage is the easy response. Empathy and learning is the valuable one. Hugs to the AWS team, and good learnings for everyone. |
I'm outraged that AWS, as a company policy, continues to lie about the status of their systems during outages, making it hard for me to communicate to my stakeholders.
Empathy? For AWS? AWS is part a mega corporation that is closing in on 2 TRILLION dollars in market cap. It's not a person. I can empathize with individuals who work for AWS but it's weird to ask us to have empathy for a massive faceless, ruthless, relentless, multinational juggernaut.