|
The root comment asked if I'd been part of an org scaling orders of magnitude quickly, so I'll actually answer it: Venda at Christmas peak (pre-cloud, hardware on 4 month lead times, ~1% of global web traffic at peak) and The Division at launch (new IP, day-zero always-online AAA, ops team of 2). Different shapes, same playbook, both worked. So with the credentialing question out of the way.. GitHub's own April post-mortem names the causes in their own words: tight coupling allowing localised failures to cascade, and inability to shed load from misbehaving clients. Their March report says one of the March outages "shared the same underlying cause" as a February one - i.e. they hit the same rake twice in two months. Cascade isolation has a dedicated chapter in the SRE book from 2016. Load shedding is older than that; the Erlang/OTP people were writing about it in the 80s. This isn't research territory, it's a syllabus, and GitHub is fumbling it with Microsoft's chequebook behind them. Amazon and Blizzard aren't the slam-dunk examples you want them to be either. Prime Day 2018 fell over because their auto-scaling failed and they had to manually add servers - that's not "well-known by now", that's a company at literal planetary scale getting caught short on the one day of the year it was guaranteed to matter. And Blizzard's Lord of Hatred launch this week is doing the exact same login-queue routine Diablo's done at every launch in living memory. If those are your "two decades of solved problems", the bar is on the floor. Your 100x rearchitecture story actually argues my position, by the way. You described tight coupling causing cascading failures across services, and the fix was to decouple. That is the boring operational discipline I'm saying has atrophied - you and your team did the work. The point is GitHub, a decade later, with Microsoft's resources and thirty times the headcount, is putting out post-mortems that read like undergraduate distributed systems coursework. So no - the question isn't whether GitHub's problem is hard. Every scaling problem looks hard from inside. The question is whether the operational discipline that solved this class of problem in the 2000s and 2010s is still being practised, or whether the industry has quietly decided "it's complicated" is sufficient cover. |
1) The general cause of issues in these cases is that certain assumptions no longer hold, and above a certain level of complexity, there are too many assumptions to keep track of, and so things fail in surprising ways. Like, the need for auto-scaling was well-known and Amazon did have that solution in place. But I recall the 2018 Prime Day was record-breaking, so it is likely the very same auto-scaling service that was supposed to save them fell over because they forecast too conservatively! (As an aside, I follow a senior AMZN engineer who's made his career out of load-testing their services, and he has many fun war stories.)
2) The resiliency work is not done upfront because it is additional complexity that may not be needed. "You're not Google" and YAGNI is sound advice most of the times. So the system is designed with some "reasonable" assumptions (which... see above!) At larger companies, resiliency mechanisms (load-shedding etc.) are built into standard components, but then...
3) Different performance profiles require different resiliency mechanisms, and it's not always clear what they would be.
Going back to the example of the 3rd party API service, when we inherited it around ~2012, it was built on standard infrastructure components with in-built resiliency mechanisms... but those were designed for internal services with latencies expected in milliseconds, whereas our downstream calls could go into seconds or even minutes. Still, with the traffic then, with a little tuning it worked fine and served the company well... until we (or the 3rd party APIs!) hit a certain scale and started seeing issues. At this point we extrapolated the trends, benchmarked heavily, and re-architected. And then we hit new scales and new use-cases that surfaced new issues, so we had to re-architect again!
The point is, the system's performance profile was very different from typical web services (the primary culprits being extremely high variance in downstream characteristics and very non-linear growth) and it was non-obvious to scale with conventional wisdom. I do not know what's happening at GitHub, but I suspect they have some similarly unique performance aspects.