|
|
|
|
|
by keithwarren
4491 days ago
|
|
Aside from the rhetoric, the political opinions and people firing at you now over the 5-10K thing, I do have an actual question of substance. My assumption when things melted down so drastically, was that the key problem was integration with the various vendors. Yes we all could look at the Html itself and make assumptions about poor practices but given that you are on the inside - what was the biggest issue? |
|
That's what Mikey and the other Site Reliability Engineers fixed. They set up a war room with an engineer from each and every subcontractor, and the war room had three rules:
Rule 1: "The war room and the meetings are for solving problems. There are plenty of other venues where people devote their creative energies to shifting blame."
Rule 2: "The ones who should be doing the talking are the people who know the most about an issue, not the ones with the highest rank. If anyone finds themselves sitting passively while managers and executives talk over them with less accurate information, we have gone off the rails, and I would like to know about it."
Rule 3: "We need to stay focused on the most urgent issues, like things that will hurt us in the next 24-48 hours."
Once you have that process working, it's the same as optimizing software: you find the current bottleneck, fix it, find the next, etc. The Time article mentions two -- the lack of DB caching and the bad ID generator. There were dozens of things like that. And still are!