Hacker News new | ask | show | jobs
by akie 2246 days ago
The contents of my backups are never the same, not from one single day to the other - so hashes would be useless.
1 comments

You don't need hashes to match between days at all though. You simply hash the file that was just backed up, and the the backup copy of it, then compare the two
I guess it depends somewhat on what you're backing up and what the anticipated failure modes might be. As an example, if there was a bug in my todo software that deleted a bunch of entries, the hash scheme wouldn't pick that up. You've just successfully backed up corrupted data, and you're not aware of it. SQL dumps would be another good example of this. If one day you do a backup and the backup reports that it has archived significantly fewer rows than yesterday, you know something's up. Maybe a fault lost some data, maybe the archiver is broken, etc.
What you're describing is significantly harder than what's described in the blog post. Not only do you have to validate a file looks like a .jpg/.json/.zip file, you also need to validate that it looks semantically correct (ie. the file format is valid but a chunk of it is missing).

Most people solve this issue by keeping multiple versions, not by trying to "validate" the backups somehow.

Maybe I'm misinterpreting the output from their backup tool, but isn't that exactly what it's doing?

    Metrics for todoist-fullsync:
    
    name   1 days ago  8 days ago
    ------------------------------
    files  1           1
    bytes  82363       86661
    items  85          87
The "items" line there seems like it's actually parsing the file and counting the number of entries in it? It's also captured in point #2: "Can be intuitively evaluated as plausible or suspicious. If the number of tasks in my Todoist export dropped from dozens to 1, that would be cause for concern."
Right. Aside from 'files' and 'bytes', the metrics are just the result of running shell commands specified in a config file. In this case it's `jq '.items | length' $PARCEL_PATH`, i.e., parse the file and print the length of the attribute named "items".

Obviously, that won't catch all potential problems in the file, but it's a low-effort way to catch some.

I keep multiple versions as well, and also use third-party backup software on all these files. These techniques are meant to be part of something analogous to a 'defense in depth' against errors in the backup process, not thorough or foolproof.

> Not only do you have to validate a file looks like a .jpg/.json/.zip file, you also need to validate that it looks semantically correct (ie. the file format is valid but a chunk of it is missing).

But you don't have to do that perfectly to get value out of it; for example:

- If the .json file parses as json, then at least you probably didn't truncate the download mid-stream.

- If it also contains a particular attribute, then you probably didn't save a structured error response instead of the actual data, or save something from a radically-nonpassively-changed endpoint that might no longer be adequate.

- If it also has roughly the number of elements you expect, you probably didn't miss entire pages of the response.

This works except in the case where your backups include live database files (where you put the database in extended logging mode, back up the data files while they are being modified, then back up the logs).

I haven't found a good way to verify these without doing a full database restore and seeing if the logs apply cleanly, along with having the DB do internal checks.

Isn't this use case solved by snapshotting the volume, then backing up the snapshot? Since the snapshot captures the filesystem state at a point in time, any database that's crash-tolerant should be fine. Snapshotting is natively supported on Windows and Macs, not sure about linux.
> Snapshotting is natively supported on Windows and Macs, not sure about linux.

JFTR, this is supported on Linux as well and, especially when using LVM, is quite simple and straightforward.

You can do it manually [0,1] or using tools made for just this purpose, such as mylvmbackup [2] (which should be available in most distribution's package repositories).

---

[0]: https://www.badllama.com/content/mysql-backups-using-lvm-sna...

[1]: https://www.percona.com/blog/2006/08/21/using-lvm-for-mysql-...

[2]: https://www.lenzg.net/mylvmbackup/