| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by earless1 2659 days ago
	What manner of failure would cause such globally deployed and distributed systems to go down like this? I'm very interested to read up on this when they release details of the failure.

5 comments

rwultsch 2659 days ago

Short duration: network, bad software deploy Long duration: db. If you break data, it takes a while to unbreak.

Source: Me. My career has been spent managing db's for internet scale sites.

link

dsfyu404ed 2659 days ago

I work for a smaller but comparably large platform. "If everything is down check the DB" is at the top of one of our internal monitoring websites in red.

Screw ups related to data loss are rare (I've been here years and haven't seen one with the DBs that the stuff I work with uses) but failures at this scale tend to cascade a little ways and it takes time to dig out of the hole. They probably have the problem solved but they have to spend a bunch of time synchronizing things and verifying the fix before they press the big red "go live" button.

link

pferde 2658 days ago

Shouldn't the monitoring websites be able to check the DB status for you before you even look at that red text? :)

link

dsfyu404ed 2658 days ago

We have a different dedicated page that gives an overviews of what's going on with the DB. The page in question is supposed to be a single stop that lets you visually get an overview of the state of the application servers and whether things are "normal" and if not allow you to quickly identify what is not normal.

link

cheeze 2659 days ago

Nothing worse than that sinking feeling of "oh fuck, we have to backfill a lot of data.

link

WrtCdEvrydy 2659 days ago

Why did my username on this site just change to 'test123'... oh, where clauses.

link

jomkr 2658 days ago

Nothing worse then the page on Friday night, oh there goes my weekend.

link

fixermark 2659 days ago

I have no inside knowledge of this one, but broadly speaking, these sorts of failures can be caused by a change thought innocent at the time to the core software that is then widely deployed using automated systems. If the core's tests didn't catch a real issue in production (and for whatever reason, the rollout happens faster than the regular small-release verification process can catch the error), things can go sour in a way that's expensive to un-sour.

Amazon once pushed a seemingly-innocuous change to their internal DNS that caused all the routers between and within datacenters to drop their IP tables on the floor. They had to re-establish the entire network by hand---datacenter heads calling each other up and reading IP address ranges over the phone to be hand-entered into lookup tables. Cost a fortune in lost sales for the time the whole site was inaccessible.

link

str33t_punk 2659 days ago

As someone who works at a large company in the networking space, you would be surprised that minor changes to configuration can cause catastrophic failures that are really challenging to come back from

Network failures are usually really bad when your system is globally deployed and distributed -- often times you can't even communicate with your machines to deliver fixes :p

link

phoe-krk 2659 days ago

An expired certificate, for instance.

https://www.thesslstore.com/blog/expired-certificate-ericsso...

link

jankassens 2659 days ago

Here's one example https://rachelbythebay.com/w/2019/01/20/quiet/

link