|
|
|
|
|
by notacoward
1343 days ago
|
|
Spot on, and thank you. My second team was ~40 (might have peaked at ~50) split across four sub-teams, for software that ran at similar scale and was designed and developed to rely heavily on other in-house infra. Maybe half a dozen people on adjacent teams (including customers) who had more than trivial knowledge of our system. Some in our team were almost pure developers, some were almost pure operators, most were at various points in between. I think the reason you and I (we know each other on Twitter BTW) are so at odds with some of the other commenters is that they haven't maxed out on automation yet and don't realize that's A Thing. Automation is absolutely fantastic and essential for running anything at this scale, but it's no panacea. While it usually helps you get more work done faster, sometimes it causes damage faster. Some of our most memorable incidents involved automation run amok, like suddenly taking down 1000 machines in a cluster for trivial reasons or even false alarms while we were already fighting potential-data-loss or load-storm problems. That, in turn, was largely the result of the teams responsible for that infra only thinking about ephemeral web workers and caches, hardly even trying to understand the concerns of permanent data storage. But I digress. The point, still, is that when you've maxed out on what automation can do for you, your remaining workload tends to scale by cluster. And having thousands instead of dozens of clusters sounds like a nightmare. There are many ways to scale such systems. Increasing the size of individual clusters sure as hell ain't easy - I joined that team because it seemed like a hard and fun challenge - but ultimately it pays off by avoiding the operational challenge of Too Many Clusters. |
|