| Running out of file handles and other IO limits is embarrassing and happens at every company, but I’m surprised that AWS was not monitoring this. I’m also surprised at the general architecture of Kinesis. What appears to be their own hand rolled gossip protocol (that is clearly terrible compared to raft or paxos, a thread per cluster member? Everyone talking to everyone? An hour to reach consensus?) and the front end servers being stateful period breaks a lot of good design choices. The problem with growing as fast as Amazon has is that their talent bar couldn’t keep up. I can’t imagine this design being okay 10 years ago when I was there. |
I see where you're coming from with this, but you really have to wonder. It sounds more like the original architects made implicit assumptions regarding scale that, likely due to original architects and engineers moving on, were not re-evaluated by the current engineers on Kinesis as Kinesis grew. While it may take an hour now for the front-end cache to sync, I find it highly unlikely that it needed that much time when Kinesis first launched.
The process failure here is organizational, one where Amazon failed to audit its current systems in a complete and current manner such that sufficient attention and resources could be paid to a re-architecture of a critical service before it caused the service to fail. Even now, vertically scaling the front-end cache fleet is just a band-aid - eventually, that won't be possible anymore. Sadly, the postmortem doesn't seem to identify the organizational failure that was the true root cause of the outage.