Hacker News new | ask | show | jobs
by upon_drumhead 1027 days ago
When I started at a large company a few years back, the company specified restoration test was just that the archive restored successfully onto a server, not that there was anything actually in the archive. Digging into it, the archives were all empty due to a commit a few years previous that added an incorrect exclude option that ended up excluding all files.

They were running for years on the cusp of total failure and had automated restoration tests that caused a false sense of security in the tooling.

The second thing I did was adjust the restoration tooling to validate data existence and over time added validation tests (percent of data matched current live systems, specific fields and values were there, etc).

It's just too easy to screw up, doubly so when time constrained and alone doing the best you can without any oversight.

2 comments

That’s one of the reasons I started tracking the filezise of the archive, and monitoring that over time. Datadog will trigger an alert if our backups are suddenly X% smaller.
Empty backups or exponentially I creasing in size backups of backups. The yin and yang of backup bugs.