No, it sounds good, because it's realistic and then you can build mitigation strategies.
I was recently involved in an outage that occurred because the sama datacenter was hit by lightning three times in a row. Everything was redundant up the wazoo and handled the first two hits just fine, but by the time the power went out for the third time within N minutes, there wasn't enough juice left in some of the batteries!
Now would it be possible to build an automated system that can withstand this? Probably. But would your time & money be better spend worrying about other failure modes? Almost certainly.
If your plan to avoid downtime is to prevent power outages, you're going to have downtime. All their sentence says is they can't prevent power outages. That's fine, because the other 1/nth of your servers are on a different power grid in a different state.
Whose datacenter are they in? This is the second time in less than two weeks that they've suffered a power-related issue. My company is in 4 different sites around the world and we've never lost power ever - and, if one circuit did go out, we'd still be up and running because all of our servers have redundant power supplies on separate infeed circuits.
"...but we can take steps to ensure recovery occurs in a fast and reliable manner. We can also take steps to mitigate the negative impact of these events on our users."
The lessons that giants like Netflix have learned about running massive distributed applications show that you cannot avoid failure, and instead must plan for it.
Now, having a single datacenter is not a good plan if you want to give any sort of uptime guarantee, but that's a different point to make.
My point is: they shouldn't ONLY plan on ensuring recovery occurs fast; they should also plan on having multiple data centers, which to me is more important. It's frightening to know that such an important service is only operating in a single data center.
However, their recovery report didn't mention anything about such a plan.
I completely agree that geo-redundancy is a hard requirement for a site as critical to the functioning of the internet as Github.
A generous reading of "We can also take steps to mitigate the negative impact of these events on our users." would include improvements of that sort.
That said, I also didn't spot any concrete proposals for geo-redundancy in the post-mortem. Perhaps that's a detail that will be figured out in a following exercise, or perhaps they really don't have any plans for GR, in which case the generous reading would be unwarranted.
Why not? There's no reason in principle that you can't have hot standbys that are switched over to immediately when the primary fails. Or even a no-primary setup with each cluster being master for some data and slave for others (a la Cassandra's replication model).
There may be specific aspects of Github's usecase that make this difficult, but please don't pretend that geo-redundancy is impossible. Look at Netflix's architecture for an example of a site that services traffic from multiple AZs.
I was recently involved in an outage that occurred because the sama datacenter was hit by lightning three times in a row. Everything was redundant up the wazoo and handled the first two hits just fine, but by the time the power went out for the third time within N minutes, there wasn't enough juice left in some of the batteries!
Now would it be possible to build an automated system that can withstand this? Probably. But would your time & money be better spend worrying about other failure modes? Almost certainly.