Hacker News new | ask | show | jobs
by tonylxc 3795 days ago
My point is: they shouldn't ONLY plan on ensuring recovery occurs fast; they should also plan on having multiple data centers, which to me is more important. It's frightening to know that such an important service is only operating in a single data center.

However, their recovery report didn't mention anything about such a plan.

<< Edited: correct a grammar error.

2 comments

I completely agree that geo-redundancy is a hard requirement for a site as critical to the functioning of the internet as Github.

A generous reading of "We can also take steps to mitigate the negative impact of these events on our users." would include improvements of that sort.

That said, I also didn't spot any concrete proposals for geo-redundancy in the post-mortem. Perhaps that's a detail that will be figured out in a following exercise, or perhaps they really don't have any plans for GR, in which case the generous reading would be unwarranted.

You're not going to fail over to a secondary datacenter in under 125 minutes. A RTO under that costs a prohibitively stupid amount.
Why not? There's no reason in principle that you can't have hot standbys that are switched over to immediately when the primary fails. Or even a no-primary setup with each cluster being master for some data and slave for others (a la Cassandra's replication model).

There may be specific aspects of Github's usecase that make this difficult, but please don't pretend that geo-redundancy is impossible. Look at Netflix's architecture for an example of a site that services traffic from multiple AZs.