| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by keeda 49 days ago

Agreed, the techniques in general (caching, backpressure, exponential backoffs, etc.) are well-known, but a couple of things:

1) The general cause of issues in these cases is that certain assumptions no longer hold, and above a certain level of complexity, there are too many assumptions to keep track of, and so things fail in surprising ways. Like, the need for auto-scaling was well-known and Amazon did have that solution in place. But I recall the 2018 Prime Day was record-breaking, so it is likely the very same auto-scaling service that was supposed to save them fell over because they forecast too conservatively! (As an aside, I follow a senior AMZN engineer who's made his career out of load-testing their services, and he has many fun war stories.)

2) The resiliency work is not done upfront because it is additional complexity that may not be needed. "You're not Google" and YAGNI is sound advice most of the times. So the system is designed with some "reasonable" assumptions (which... see above!) At larger companies, resiliency mechanisms (load-shedding etc.) are built into standard components, but then...

3) Different performance profiles require different resiliency mechanisms, and it's not always clear what they would be.

Going back to the example of the 3rd party API service, when we inherited it around ~2012, it was built on standard infrastructure components with in-built resiliency mechanisms... but those were designed for internal services with latencies expected in milliseconds, whereas our downstream calls could go into seconds or even minutes. Still, with the traffic then, with a little tuning it worked fine and served the company well... until we (or the 3rd party APIs!) hit a certain scale and started seeing issues. At this point we extrapolated the trends, benchmarked heavily, and re-architected. And then we hit new scales and new use-cases that surfaced new issues, so we had to re-architect again!

The point is, the system's performance profile was very different from typical web services (the primary culprits being extremely high variance in downstream characteristics and very non-linear growth) and it was non-obvious to scale with conventional wisdom. I do not know what's happening at GitHub, but I suspect they have some similarly unique performance aspects.