Hacker News new | ask | show | jobs
by tonyarkles 2248 days ago
I guess it depends somewhat on what you're backing up and what the anticipated failure modes might be. As an example, if there was a bug in my todo software that deleted a bunch of entries, the hash scheme wouldn't pick that up. You've just successfully backed up corrupted data, and you're not aware of it. SQL dumps would be another good example of this. If one day you do a backup and the backup reports that it has archived significantly fewer rows than yesterday, you know something's up. Maybe a fault lost some data, maybe the archiver is broken, etc.
1 comments

What you're describing is significantly harder than what's described in the blog post. Not only do you have to validate a file looks like a .jpg/.json/.zip file, you also need to validate that it looks semantically correct (ie. the file format is valid but a chunk of it is missing).

Most people solve this issue by keeping multiple versions, not by trying to "validate" the backups somehow.

Maybe I'm misinterpreting the output from their backup tool, but isn't that exactly what it's doing?

    Metrics for todoist-fullsync:
    
    name   1 days ago  8 days ago
    ------------------------------
    files  1           1
    bytes  82363       86661
    items  85          87
The "items" line there seems like it's actually parsing the file and counting the number of entries in it? It's also captured in point #2: "Can be intuitively evaluated as plausible or suspicious. If the number of tasks in my Todoist export dropped from dozens to 1, that would be cause for concern."
Right. Aside from 'files' and 'bytes', the metrics are just the result of running shell commands specified in a config file. In this case it's `jq '.items | length' $PARCEL_PATH`, i.e., parse the file and print the length of the attribute named "items".

Obviously, that won't catch all potential problems in the file, but it's a low-effort way to catch some.

I keep multiple versions as well, and also use third-party backup software on all these files. These techniques are meant to be part of something analogous to a 'defense in depth' against errors in the backup process, not thorough or foolproof.

> Not only do you have to validate a file looks like a .jpg/.json/.zip file, you also need to validate that it looks semantically correct (ie. the file format is valid but a chunk of it is missing).

But you don't have to do that perfectly to get value out of it; for example:

- If the .json file parses as json, then at least you probably didn't truncate the download mid-stream.

- If it also contains a particular attribute, then you probably didn't save a structured error response instead of the actual data, or save something from a radically-nonpassively-changed endpoint that might no longer be adequate.

- If it also has roughly the number of elements you expect, you probably didn't miss entire pages of the response.