| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by brandonb 4491 days ago

The biggest general problem is that people thought of shipping the site the same way they thought of shipping an aircraft carrier -- you write the code, hand it over, and you're done. It wasn't treated as a running service. So, for example, when the site went down, there wasn't a group of people responsible for bringing it back up.

That's what Mikey and the other Site Reliability Engineers fixed. They set up a war room with an engineer from each and every subcontractor, and the war room had three rules:

Rule 1: "The war room and the meetings are for solving problems. There are plenty of other venues where people devote their creative energies to shifting blame."

Rule 2: "The ones who should be doing the talking are the people who know the most about an issue, not the ones with the highest rank. If anyone finds themselves sitting passively while managers and executives talk over them with less accurate information, we have gone off the rails, and I would like to know about it."

Rule 3: "We need to stay focused on the most urgent issues, like things that will hurt us in the next 24-48 hours."

Once you have that process working, it's the same as optimizing software: you find the current bottleneck, fix it, find the next, etc. The Time article mentions two -- the lack of DB caching and the bad ID generator. There were dozens of things like that. And still are!

3 comments

mbesto 4491 days ago

> The biggest general problem is that people thought of shipping the site the same way they thought of shipping an aircraft carrier -- you write the code, hand it over, and you're done.

So, the same problem every vendor encounters when doing a software project. I cannot tell you how many "lay people" can't simply grok this.

link

Perdition 4491 days ago

It's amusing because that isn't the way aircraft carriers are made. Something like a tank might just get handed over after only manufacturer testing, but big ticket items like aircraft carriers go through a year plus of acceptance trials and testing.

They don't finish the last coat of paint, load an air wing, and send it out on deployment.

link

twistedpair 4491 days ago

Amen. Just watch any documentary about building an aircraft carrier. The systems integration phase is the longest part!

link

liyanchang 4490 days ago

The difference here is that they had a very public deadline.

link

keyhole_downs 4490 days ago

The difference here is that politicians treat technology the way voters treat elections.

link

keithwarren 4491 days ago

What is the biggest issue right now architecturally?

link

brandonb 4491 days ago

Complexity. We have 10x more code, 10x more components, and 10x more layers than we need. If the initial architecture had been dirt-simple, I don't think the site would have had so much trouble scaling or staying up. But, of course, removing complexity without breaking things takes longer than adding it in the first place.

The other big problem is operations — many steps are done manually which should be done with tools like chef or puppet. When you have a lot of manual steps in your deploy process, it makes the whole system harder to scale, test, modify, and keep running. "Devops" has become a buzzword but it's definitely needed here.

link

keithwarren 4491 days ago

Was there consideration given to doing a wholesale rewrite and later swapping out for that? Seems that if things are overly complex now, it may be easier to hold it together with band-aids while a true long term build is put together...what is going on behind the scenes with respect to this kind of planning?

link

keyhole_downs 4490 days ago

Better call Scott!

link

grecy 4491 days ago

My god I wish my company would instigate Rule 2 !

link