Hacker News new | ask | show | jobs
by firecraker 1073 days ago
>2017/01/31 23:00-ish YP thinks that perhaps pg_basebackup is being super pedantic about there being an empty data directory, decides to remove the directory. After a second or two he notices he ran it on db1.cluster.gitlab.com, instead of db2.cluster.gitlab.com >2017/01/31 23:27 YP - terminates the removal, but it’s too late. Of around 310 GB only about 4.5 GB is left

I can't even imagine the sinking feeling..

2 comments

> YP says it’s best for him not to run anything with sudo any more today, handing off the restoring to JN.

Then in the post-mortem about lack of backups:

> LVM snapshots are by default only taken once every 24 hours. YP happened to run one manually about 6 hours prior to the outage > Regular backups seem to also only be taken once per 24 hours, though YP has not yet been able to figure out where they are stored. According to JN these don’t appear to be working, producing files only a few bytes in size.

I have had (and inevitability will have again) bad days like poor YP. All I can count on is to maintain good habits, like making backups before undergoing production work like YP did.

> like making backups before undergoing production work

The specific part you mention also brings up a really vital part of a backup system, testing that the backups generated actually can restored.

I've seen so many companies with untested recovery procedures where most of the time they just state something like "Of course the built-in backup mechanism work, if it didn't, it wouldn't be much of a backup, would it? Haha" while never actually tried to recover from it.

Although, to be fair, I've only seen one time out of the untested 10s where it had an actual impact and the backups actually didn't work, but the morale hit that the company ended up having made my brain really remember the fact to test your backups.

Indeed, the feeling of dread when you do something that causes prod to go down is bad enough. I can't even imagine the feeling when accidentally deleting prod data...