| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rwultsch 2658 days ago
	Short duration: network, bad software deploy Long duration: db. If you break data, it takes a while to unbreak. Source: Me. My career has been spent managing db's for internet scale sites.

2 comments

dsfyu404ed 2658 days ago

I work for a smaller but comparably large platform. "If everything is down check the DB" is at the top of one of our internal monitoring websites in red.

Screw ups related to data loss are rare (I've been here years and haven't seen one with the DBs that the stuff I work with uses) but failures at this scale tend to cascade a little ways and it takes time to dig out of the hole. They probably have the problem solved but they have to spend a bunch of time synchronizing things and verifying the fix before they press the big red "go live" button.

link

pferde 2657 days ago

Shouldn't the monitoring websites be able to check the DB status for you before you even look at that red text? :)

link

dsfyu404ed 2657 days ago

We have a different dedicated page that gives an overviews of what's going on with the DB. The page in question is supposed to be a single stop that lets you visually get an overview of the state of the application servers and whether things are "normal" and if not allow you to quickly identify what is not normal.

link

cheeze 2658 days ago

Nothing worse than that sinking feeling of "oh fuck, we have to backfill a lot of data.

link

WrtCdEvrydy 2657 days ago

Why did my username on this site just change to 'test123'... oh, where clauses.

link

jomkr 2657 days ago

Nothing worse then the page on Friday night, oh there goes my weekend.

link