Hacker News new | ask | show | jobs
by fabled_giraffe 3656 days ago
You'd be surprised how tenuous backup situations are at most companies and organizations. Even if your company is doing full disaster recovery checks 24/7, one after the other, a "well-placed" failure/mistakes or small series of failures/mistakes can lead to data loss.

Many of the things that cause data loss are just simple mistakes caused by a failure to review changes carefully, even when you have multiple levels of review- I would dare to say especially when you have high confidence that someone else is reviewing your changes.

"Manual" data changes, e.g. executing SQL statements or scripts that execute SQL that aren't a part of your application, in my experience are the most common cause of data loss.

After manual data changes the second most common in my experience is not understanding what you are doing. For example, you might take a chance on an upgrade that fails because you have to meet a deadline.

Changes to application code are next. Typically when making changes to an application, a little more thought may be put into it than a one-off data migration or change, but if you are under time pressure, don't know what you are doing, or are assuming someone else will catch your mistakes, you could easily screw everything.

Following this- mistakes that cause hardware or software failure. I worked at one large organization where storage arrays with various power backups were just "turned off" by a contractor that didn't understand the impact of what he or she was doing.

Finally, you might have configuration issues or the hardware might just fail.

Really, there is no substitute for having your data backed up frequently, and in a way you know how to easily restore and have tested, by building another machine from the ground up to replace it and documenting and practicing that well. Very few do this frequently. And even if you do- what if all of your hardware were destroyed? Can you easily go out and buy something off the shelf with instructions you have in your head or stored safely around the world and rebuild everything?

2 comments

Please note the severity of this. It's not "We had a crash and lost a weeks worth of data" - yeah, that happens. And it's totally acceptable if you're unable to restore it, for any of the reasons you mention.

This is "We had a crash and lost over a decade's worth of data" - yeah, that shouldn't happen. Ever. If you don't have a working backup from sometime in the past 13 years, -you are doing something wrong-. That's not a small series of failures or mistakes. That's either staggering levels of incompetence, or maliciousness.

This is exactly right. A misconfigured backup script, or a pessimal chain of crashes, or a lot of other things could lose recent data. I wouldn't be shocked by data loss.

This is decade of data. It should have been on in cold storage, on tested media, in multiple locations. No single error, or small chain of errors, should have enabled this to happen.

We are not talking about "most companies". We are talking of the air force, an organisation that is built to be resilient and handle crisis. Backups, fallbacks, redundancy, checks, reviews, access management and risk evaluation are BAKED IN their culture and mission.

And we are not talking about "any data". This was cleary very sensitive data they new they needed to protect.

Either the Air Force is failling at being the very thing it's been created to be (which I doubt) or something is fishy (ocaml razor).