Hacker News new | ask | show | jobs
by simula67 1020 days ago
> While manual backup and restore tests were run once a month to ensure our backups were functioning, they were run manually. After digging into why our restores were not coming up with data, I found that our recurring backups were missing the flag to run volume backups with Restic which snapshots PVC block volume data.

Can someone explain this? How did they test restores, if the actual restore failed to come up with data?

3 comments

When I started at a large company a few years back, the company specified restoration test was just that the archive restored successfully onto a server, not that there was anything actually in the archive. Digging into it, the archives were all empty due to a commit a few years previous that added an incorrect exclude option that ended up excluding all files.

They were running for years on the cusp of total failure and had automated restoration tests that caused a false sense of security in the tooling.

The second thing I did was adjust the restoration tooling to validate data existence and over time added validation tests (percent of data matched current live systems, specific fields and values were there, etc).

It's just too easy to screw up, doubly so when time constrained and alone doing the best you can without any oversight.

That’s one of the reasons I started tracking the filezise of the archive, and monitoring that over time. Datadog will trigger an alert if our backups are suddenly X% smaller.
Empty backups or exponentially I creasing in size backups of backups. The yin and yang of backup bugs.
During their tests, they were creating new backups and testing a restore with them. They're weren't actually testing their real backups.
Velero can be configured to run on a schedule, but the scheduled command was apparently not the exact same command they were using to perform the manual tests - the scheduled job was missing part of the command, basically.
Sp they were manually doing a backup and then testing that backup, rather than testing their automated backups? If thats what they were doing that just makes me wonder... why?
It's possible that the backup restore/test process was still 'automated' to some degree, like a manually-run pipeline or playbook etc.

Obviously there were mistakes made in process design and tools deployment but I don't think this is 'baffling incompetence' more than it is a couple of small mistakes that compounded each other to have a pretty spectacular impact.

I am not sure about the pricing structure of their provider, but maybe it is a Hotel California situation? Data egress price of the provider was such that they did not want to pay to retrieve the full backup "just for a test"? Instead, repeat the backup command, but swap it out for a local location.