Hacker News new | ask | show | jobs
by mjevans 3787 days ago
This just shows how difficult it is to avoid hidden dependencies without a complete, cleanly isolated, testing environment of sufficient scale to replicate production operations and do strange system fault scenarios somewhere that won't kill production.
2 comments

It turns out that it's even hard then. Complex systems, by their very nature, fail in unexpected and unpredictable ways. If that weren't bad enough, hindsight bias makes it way too easy for us to look back with perfect knowledge and opine "That was so obvious, how could they have missed such a rudimentary issue?"

If only things were that easy.

I'm not sure what part of servers failing to POST is especially complex or related to distributed computing.

For all the fawning over being provided technical details, this article was pretty light on them.

I don't think Github going down for a couple hours is that big of a deal TBH. But it does seem to expose a few really basic failings in their DR planning IMO.

I also think it's ridiculous that some commenters are trying to frame this as a distributed computing problem. It's not even a clustering problem (apparently). It's just looking at the iDRAC or whatever to see why the server isn't getting past POST and putting your recovery plan into action.

This is white box vanilla stuff that happens to everybody.

That servers had to be rebuilt as part of DR says a lot.

The fact that there was a Redis dependency during bootstrap? Probably a good thing. You know as well as anyone I'm sure the last thing you want is a bunch of processes that only look like they're up. And even if they could not error without their Redis connections, if Redis is used for caching, what's that going to do to availability? Would it be a good thing to have the processes up if they can only handle 10% of the usual load?

Those are details that aren't there.

But complex distributed computing problem this is not. Not as it was presented anyways.

Or use the Netflix model: Chaos testing in production.
No system is perfect; as you continue to add 9s, the cost increases steeply.

Usually its just cheaper to be down for an hour or two, versus architect for the end of times.

> Usually its just cheaper to be down for an hour or two, versus architect for the end of times

The opposite of this philosophy was the motivation behind creation of the internet in the first place.

This seems precisely wrong. Some reading:

http://web.mit.edu/Saltzer/www/publications/endtoend/endtoen...

https://www.jwz.org/doc/worse-is-better.html

[thanks for the hint 'thinkpad20! I don't know what I was thinking.]

Just a note: if you don't indent your links they'll be made clickable by the markup engine, which is convenient in general and especially for those of us on smart phones. :)
I think he was referring to ARPANET being a military project whose goal was a system that could survive a nuclear attack or other such calamity.
It is not precisely wrong, and thanks for tricking me into opening an obscene picture at work, asshole.

The internet is designed to be highly fault tolerant, because it was based on an arpanet project to design a network that would NOT go down, even if there was damage to a significant percentage of nodes.

The "asshole" in this case is JWZ, [randomly?] switching on the Referer header. Apparently he has a hard-on for HN; he's not the only one, but I won't be linking to his site again. (Although, is that really "obscene"? It doesn't do anything for me?) Try this instead, since Stanford are unlikely to engage in such shenanigans:

https://web.stanford.edu/class/cs240/old/sp2014/readings/wor...

It's funny, my original comment had the links in plaintext so copying-and-pasting was required and Referer wasn't involved. I changed that on request. b^)

And yet, we have services who still don't waste the cost on having geographically disperse datacenters.
Part of our Chaos testing in prod is exercising our ability to route traffic around failures of entire regions. jobs.netflix.com
Or Google for that matter. DiRT.