|
|
|
|
|
by stefan_bobev
240 days ago
|
|
I appreciate the details this went through, especially laying out the exact timelines of operations and how overlaying those timelines produces unexpected effects. One of my all time favourite bits about distributed systems comes from the (legendary) talk at GDC - I Shot You First[1] - where the speaker describes drawing sequence diagrams with tilted arrows to represent the flow of time and asking "Where is the lag?". This method has saved me many times, all throughout my career from making games, to livestream and VoD services to now fintech. Always account for the flow of time when doing a distributed operation - time's arrow always marches forward, your systems might not. But the stale read didn't scare me nearly as much as this quote: > Since this situation had no established operational recovery procedure, engineers took care in attempting to resolve the issue with DWFM without causing further issues Everyone can make a distributed system mistake (these things are hard). But I did not expect something as core as the service managing the leases on the physical EC2 nodes to not have recovery procedure. Maybe I am reading too much into it, maybe what they meant was that they didn't have a recovery procedure for "this exact" set of circumstances, but it is a little worrying even if that were the case. EC2 is one of the original services in AWS. At this point I expect it to be so battle hardened that very few edge cases would not have been identified. It seems that the EC2 failure was more impactful in a way, as it cascaded to more and more services (like the NLB and Lambda) and took more time to fully recover. I'd be interested to know what gets put in place there to make it even more resilient. [1] https://youtu.be/h47zZrqjgLc?t=1587 |
|