| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by tonylxc 3830 days ago
	TL;DR: "We don’t believe it is possible to fully prevent the events that resulted in a large part of our infrastructure losing power, ..." This doesn't sound very good.

4 comments

jpatokal 3830 days ago

No, it sounds good, because it's realistic and then you can build mitigation strategies.

I was recently involved in an outage that occurred because the sama datacenter was hit by lightning three times in a row. Everything was redundant up the wazoo and handled the first two hits just fine, but by the time the power went out for the third time within N minutes, there wasn't enough juice left in some of the batteries!

Now would it be possible to build an automated system that can withstand this? Probably. But would your time & money be better spend worrying about other failure modes? Almost certainly.

link

jrockway 3830 days ago

If your plan to avoid downtime is to prevent power outages, you're going to have downtime. All their sentence says is they can't prevent power outages. That's fine, because the other 1/nth of your servers are on a different power grid in a different state.

link

tonylxc 3830 days ago

I totally share the same view that to best avoid failure is to embrace it and cope with it.

It is true that all their sentence is about recovery, however, it is disappointing that they didn't mention anything about a redundant datacenter.

link

otterley 3830 days ago

Whose datacenter are they in? This is the second time in less than two weeks that they've suffered a power-related issue. My company is in 4 different sites around the world and we've never lost power ever - and, if one circuit did go out, we'd still be up and running because all of our servers have redundant power supplies on separate infeed circuits.

link

theptip 3830 days ago

The rest of the sentence is pertinent:

"...but we can take steps to ensure recovery occurs in a fast and reliable manner. We can also take steps to mitigate the negative impact of these events on our users."

The lessons that giants like Netflix have learned about running massive distributed applications show that you cannot avoid failure, and instead must plan for it.

Now, having a single datacenter is not a good plan if you want to give any sort of uptime guarantee, but that's a different point to make.

link

tonylxc 3830 days ago

My point is: they shouldn't ONLY plan on ensuring recovery occurs fast; they should also plan on having multiple data centers, which to me is more important. It's frightening to know that such an important service is only operating in a single data center.

However, their recovery report didn't mention anything about such a plan.

<< Edited: correct a grammar error.

link

theptip 3830 days ago

I completely agree that geo-redundancy is a hard requirement for a site as critical to the functioning of the internet as Github.

A generous reading of "We can also take steps to mitigate the negative impact of these events on our users." would include improvements of that sort.

That said, I also didn't spot any concrete proposals for geo-redundancy in the post-mortem. Perhaps that's a detail that will be figured out in a following exercise, or perhaps they really don't have any plans for GR, in which case the generous reading would be unwarranted.

link

fapestniegd 3830 days ago

You're not going to fail over to a secondary datacenter in under 125 minutes. A RTO under that costs a prohibitively stupid amount.

link

theptip 3830 days ago

Why not? There's no reason in principle that you can't have hot standbys that are switched over to immediately when the primary fails. Or even a no-primary setup with each cluster being master for some data and slave for others (a la Cassandra's replication model).

There may be specific aspects of Github's usecase that make this difficult, but please don't pretend that geo-redundancy is impossible. Look at Netflix's architecture for an example of a site that services traffic from multiple AZs.

link