| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by keeda 49 days ago

It really, really depends on what you mean. Specifically, it depends on the application and its various compute, I/O and access patterns. Scaling ecommerce and games is well-known by now (e.g. Amazon and Blizzard have been dealing with insane scale for two decades now.) However, anything outside a well-known pattern can be very tricky to scale.

I once worked on a team had to 100x scale a system whose downstream dependencies were various 3rd party APIs and data sources, most of which had no real SLAs to speak of and had extremely high variance in latencies and data transfer patterns. This basically required rearchitecting everything including our clients because the typical transactional request/response access pattern was too tightly coupled, and any hiccup in an external API quickly rippled up through the call-tree and caused outages 3+ services removed from ours. In some cases, the re-architecting went all the way to the UI.

Years later, I led a company-wide effort to optimize our entire user-facing application infrastructure to not fall over from sharply spiking user traffic, touching dozens of services across dozens of teams. We did a brief study and realized there was not a single common solution recommendation (like "tune your caches") we could give that would help all the teams because each one had very different resource usage patterns and hence different bottlenecks. Our approach was basically to farm the task out to each team and say "here are some common metrics to look into and some common issues to look for and some common solutions, get back to us if you need help." We spent a lot of time on the help.

I have no idea what the patterns for GitHub are, but I'll note its much more than just a DB and it has a dependency (Actions) with extremely high variance in latencies and resource usage.

2 comments

keeda 49 days ago

I wrote this in response to the below comment, which is now edited and unfortunately dead, so posting here:

I understand, that wasn't a comment on your efforts back then, just that it is a solved problem today. But that does not mean other scaling problems are comparable or comparably solved. The universe of scaling problems is immense!

Worse, different problems occur at different scales. In the 3rd party API system, years after the first re-architecting, some use-cases developed issues at scale that exceeded the already high operational parameters we benchmarked at, and required us to re-architect the service again, including building out a whole new cluster so we could isolate that traffic entirely.

It is really hard to predict how things will break until they do.

(As an aside, I remember reading a lot of interesting things about Blizzard's technology, even if Blizzard didn't publish those themselves. There were many people who researched their products and published their findings. For instance, someone analyzed wireshark traces and published a very detailed report about how they tuned their server-side networking stack. One thing that stood out was Blizzard used TCP for WoW, whereas the conventional wisdom was UDP for real-time multiplayer!)

link

dijit 49 days ago

We used TCP for The Division, this was a major mistake and I don't think it was something people should repeat.

For example, if you have TCP_NODELAY and a few thousand players, you'll be swimming in about 1.2M packets per second pretty quickly.

This is enough to completely crush any stateful firewalls (UDP would pass through because no need to check state), so we had to do ACLs in network hardware instead, and append a magic number so that we could prevent flooding instead.

Another thing we found was that Windows networking activity only happens on Core0 (Windows 2012 R2); and that at 1.2M PPS: the driver crashes.

Logging in to a Windows machine which is AD connected when its network interface is dead is not ideal.

So, yeah, avoid TCP.

link

keeda 49 days ago

Makes sense, and that was the surprising thing about WoW using TCP. I wonder if Blizzard chose to put in all that extra effort to make TCP work because they encountered enough crappy home routers out there that mangled any non-TCP traffic...

link

dijit 49 days ago

The root comment asked if I'd been part of an org scaling orders of magnitude quickly, so I'll actually answer it: Venda at Christmas peak (pre-cloud, hardware on 4 month lead times, ~1% of global web traffic at peak) and The Division at launch (new IP, day-zero always-online AAA, ops team of 2). Different shapes, same playbook, both worked. So with the credentialing question out of the way..

GitHub's own April post-mortem names the causes in their own words: tight coupling allowing localised failures to cascade, and inability to shed load from misbehaving clients. Their March report says one of the March outages "shared the same underlying cause" as a February one - i.e. they hit the same rake twice in two months. Cascade isolation has a dedicated chapter in the SRE book from 2016. Load shedding is older than that; the Erlang/OTP people were writing about it in the 80s. This isn't research territory, it's a syllabus, and GitHub is fumbling it with Microsoft's chequebook behind them.

Amazon and Blizzard aren't the slam-dunk examples you want them to be either. Prime Day 2018 fell over because their auto-scaling failed and they had to manually add servers - that's not "well-known by now", that's a company at literal planetary scale getting caught short on the one day of the year it was guaranteed to matter. And Blizzard's Lord of Hatred launch this week is doing the exact same login-queue routine Diablo's done at every launch in living memory. If those are your "two decades of solved problems", the bar is on the floor.

Your 100x rearchitecture story actually argues my position, by the way. You described tight coupling causing cascading failures across services, and the fix was to decouple. That is the boring operational discipline I'm saying has atrophied - you and your team did the work. The point is GitHub, a decade later, with Microsoft's resources and thirty times the headcount, is putting out post-mortems that read like undergraduate distributed systems coursework.

So no - the question isn't whether GitHub's problem is hard. Every scaling problem looks hard from inside. The question is whether the operational discipline that solved this class of problem in the 2000s and 2010s is still being practised, or whether the industry has quietly decided "it's complicated" is sufficient cover.

link

keeda 49 days ago

Agreed, the techniques in general (caching, backpressure, exponential backoffs, etc.) are well-known, but a couple of things:

1) The general cause of issues in these cases is that certain assumptions no longer hold, and above a certain level of complexity, there are too many assumptions to keep track of, and so things fail in surprising ways. Like, the need for auto-scaling was well-known and Amazon did have that solution in place. But I recall the 2018 Prime Day was record-breaking, so it is likely the very same auto-scaling service that was supposed to save them fell over because they forecast too conservatively! (As an aside, I follow a senior AMZN engineer who's made his career out of load-testing their services, and he has many fun war stories.)

2) The resiliency work is not done upfront because it is additional complexity that may not be needed. "You're not Google" and YAGNI is sound advice most of the times. So the system is designed with some "reasonable" assumptions (which... see above!) At larger companies, resiliency mechanisms (load-shedding etc.) are built into standard components, but then...

3) Different performance profiles require different resiliency mechanisms, and it's not always clear what they would be.

Going back to the example of the 3rd party API service, when we inherited it around ~2012, it was built on standard infrastructure components with in-built resiliency mechanisms... but those were designed for internal services with latencies expected in milliseconds, whereas our downstream calls could go into seconds or even minutes. Still, with the traffic then, with a little tuning it worked fine and served the company well... until we (or the 3rd party APIs!) hit a certain scale and started seeing issues. At this point we extrapolated the trends, benchmarked heavily, and re-architected. And then we hit new scales and new use-cases that surfaced new issues, so we had to re-architect again!

The point is, the system's performance profile was very different from typical web services (the primary culprits being extremely high variance in downstream characteristics and very non-linear growth) and it was non-obvious to scale with conventional wisdom. I do not know what's happening at GitHub, but I suspect they have some similarly unique performance aspects.

link