Hacker News new | ask | show | jobs
by rwultsch 2658 days ago
Short duration: network, bad software deploy Long duration: db. If you break data, it takes a while to unbreak.

Source: Me. My career has been spent managing db's for internet scale sites.

2 comments

I work for a smaller but comparably large platform. "If everything is down check the DB" is at the top of one of our internal monitoring websites in red.

Screw ups related to data loss are rare (I've been here years and haven't seen one with the DBs that the stuff I work with uses) but failures at this scale tend to cascade a little ways and it takes time to dig out of the hole. They probably have the problem solved but they have to spend a bunch of time synchronizing things and verifying the fix before they press the big red "go live" button.

Shouldn't the monitoring websites be able to check the DB status for you before you even look at that red text? :)
We have a different dedicated page that gives an overviews of what's going on with the DB. The page in question is supposed to be a single stop that lets you visually get an overview of the state of the application servers and whether things are "normal" and if not allow you to quickly identify what is not normal.
Nothing worse than that sinking feeling of "oh fuck, we have to backfill a lot of data.
Why did my username on this site just change to 'test123'... oh, where clauses.
Nothing worse then the page on Friday night, oh there goes my weekend.