|
The biggest general problem is that people thought of shipping the site the same way they thought of shipping an aircraft carrier -- you write the code, hand it over, and you're done. It wasn't treated as a running service. So, for example, when the site went down, there wasn't a group of people responsible for bringing it back up. That's what Mikey and the other Site Reliability Engineers fixed. They set up a war room with an engineer from each and every subcontractor, and the war room had three rules: Rule 1: "The war room and the meetings are for solving problems. There are plenty of other venues where people devote their creative energies to shifting blame." Rule 2: "The ones who should be doing the talking are the people who know the most about an issue, not the ones with the highest rank. If anyone finds themselves sitting passively while managers and executives talk over them with less accurate information, we have gone off the rails, and I would like to know about it." Rule 3: "We need to stay focused on the most urgent issues, like things that will hurt us in the next 24-48 hours." Once you have that process working, it's the same as optimizing software: you find the current bottleneck, fix it, find the next, etc. The Time article mentions two -- the lack of DB caching and the bad ID generator. There were dozens of things like that. And still are! |
So, the same problem every vendor encounters when doing a software project. I cannot tell you how many "lay people" can't simply grok this.