Hacker News new | ask | show | jobs
by JPKab 4774 days ago
It's noble of you to come clean and own your mistake, but let me say this over and over:

You should never, ever provide an environment that stores people's hard work without having professionals who know how to safeguard it.

If it makes you feel any better, I recently had to clean up a mess in a huge enterprise IT shop, (if I were to name the organization you would immediately know them) involving hundreds of thousands of man-hours of work lost due to a lazy, incompetent DBA and the clueless management above her.

This "DBA" was the kind of person who came in at 9:45AM, took a 2 hour lunch at noon, and left at 3:30. Did I mention she refused a work from home option?

She didn't know how to do chron jobs, so all of her backup scripts had to be run manually. If she was on vacation, they didn't get run. Surprise Surprise, the DB died after her long pre-Christmas vacation. Zero backups for the first 3 weeks of December.

Even "professionals" can be suspect sometimes.

3 comments

Running cronjob backups and looking at them in passing to see that they look like valid backups is not sufficient for any serious website or web service.

Automated backups need automated backup restoration and testing. Otherwise, the backups might not be created properly, or they might be perfect backups that have some hidden error that will cause them to fail when they're put to use.

As an example, Jeremiah Wilton's self-case study on Amazon's Oracle database problem in 1997. http://www.bluegecko.net/download/disaster-diary.pdf

Other than the one missed backup, backup procedures were fine. An Oracle bug caused Oracle to refuse to start due to a database format/schema change weeks earlier. TESTING backups would have caught the error, and allowed them to fix it before they took down their production database and triggered the bug on the next attempt to start it again.

> You should never, ever provide an environment that stores people's hard work without having professionals who know how to safeguard it.

Even if they do know how to safeguard the data, that doesn't mean that everything else is going to work properly.

I had recently taken over IT after working for six years as a developer. In fact, this happened only a month or so in to my new role.

Our mail server died. Three out of four drives in the hardware RAID 10 failed. I'd been seeing bounces to root@localhost from root@localhost in the nightly reports, but the way things were configured made it nearly impossible to figure out where the mails were coming from. Thanks, Zimbra. We speculate that these were constant alerts from our RAID card notifying us of the impending disaster.

Oh, and the only backups for the mail store were on the machine itself, and in the local Thunderbird installs that half the company used instead of the Zimbra web interface. The machine was in a colo downtown, not local, and running backups over our pathetic little DSL connection was unmanageable.

Both of these things were known problems, both marked high priority, but both months away from being addressed when things went south.

This happened on a Friday. By Monday morning, I'd moved us over to a hosted service, manually sorted all of the mail that hit a catch-all mailbox on a VM I'd set up. By Tuesday, I'd audited every one of our other machines to make sure that mail to root was deliverable (it wasn't in about a dozen machines) and that every machine with hardware RAID had both local and remote monitoring.

Some people, including Directors and C-levels, lost up to ten years of mail. It was the worst IT disaster the company ever faced. But that's not the worst part. No, the worst part is that we're in the IT industry, and knew the entire time that what we were doing was wrong... fixing it had just never been prioritized before, because it wasn't seen as super urgent that it be fixed.

That lesson has been learned.

Sounds like they need a technical founder. Stock well spent imho...