| HN Mirror

> You should never, ever provide an environment that stores people's hard work without having professionals who know how to safeguard it.

Even if they do know how to safeguard the data, that doesn't mean that everything else is going to work properly.

I had recently taken over IT after working for six years as a developer. In fact, this happened only a month or so in to my new role.

Our mail server died. Three out of four drives in the hardware RAID 10 failed. I'd been seeing bounces to root@localhost from root@localhost in the nightly reports, but the way things were configured made it nearly impossible to figure out where the mails were coming from. Thanks, Zimbra. We speculate that these were constant alerts from our RAID card notifying us of the impending disaster.

Oh, and the only backups for the mail store were on the machine itself, and in the local Thunderbird installs that half the company used instead of the Zimbra web interface. The machine was in a colo downtown, not local, and running backups over our pathetic little DSL connection was unmanageable.

Both of these things were known problems, both marked high priority, but both months away from being addressed when things went south.

This happened on a Friday. By Monday morning, I'd moved us over to a hosted service, manually sorted all of the mail that hit a catch-all mailbox on a VM I'd set up. By Tuesday, I'd audited every one of our other machines to make sure that mail to root was deliverable (it wasn't in about a dozen machines) and that every machine with hardware RAID had both local and remote monitoring.

Some people, including Directors and C-levels, lost up to ten years of mail. It was the worst IT disaster the company ever faced. But that's not the worst part. No, the worst part is that we're in the IT industry, and knew the entire time that what we were doing was wrong... fixing it had just never been prioritized before, because it wasn't seen as super urgent that it be fixed.

That lesson has been learned.