| I'm not sure what part of servers failing to POST is especially complex or related to distributed computing. For all the fawning over being provided technical details, this article was pretty light on them. I don't think Github going down for a couple hours is that big of a deal TBH. But it does seem to expose a few really basic failings in their DR planning IMO. I also think it's ridiculous that some commenters are trying to frame this as a distributed computing problem. It's not even a clustering problem (apparently). It's just looking at the iDRAC or whatever to see why the server isn't getting past POST and putting your recovery plan into action. This is white box vanilla stuff that happens to everybody. That servers had to be rebuilt as part of DR says a lot. The fact that there was a Redis dependency during bootstrap? Probably a good thing. You know as well as anyone I'm sure the last thing you want is a bunch of processes that only look like they're up. And even if they could not error without their Redis connections, if Redis is used for caching, what's that going to do to availability? Would it be a good thing to have the processes up if they can only handle 10% of the usual load? Those are details that aren't there. But complex distributed computing problem this is not. Not as it was presented anyways. |