Hacker News new | ask | show | jobs
by gruez 2246 days ago
What you're describing is significantly harder than what's described in the blog post. Not only do you have to validate a file looks like a .jpg/.json/.zip file, you also need to validate that it looks semantically correct (ie. the file format is valid but a chunk of it is missing).

Most people solve this issue by keeping multiple versions, not by trying to "validate" the backups somehow.

2 comments

Maybe I'm misinterpreting the output from their backup tool, but isn't that exactly what it's doing?

    Metrics for todoist-fullsync:
    
    name   1 days ago  8 days ago
    ------------------------------
    files  1           1
    bytes  82363       86661
    items  85          87
The "items" line there seems like it's actually parsing the file and counting the number of entries in it? It's also captured in point #2: "Can be intuitively evaluated as plausible or suspicious. If the number of tasks in my Todoist export dropped from dozens to 1, that would be cause for concern."
Right. Aside from 'files' and 'bytes', the metrics are just the result of running shell commands specified in a config file. In this case it's `jq '.items | length' $PARCEL_PATH`, i.e., parse the file and print the length of the attribute named "items".

Obviously, that won't catch all potential problems in the file, but it's a low-effort way to catch some.

I keep multiple versions as well, and also use third-party backup software on all these files. These techniques are meant to be part of something analogous to a 'defense in depth' against errors in the backup process, not thorough or foolproof.

> Not only do you have to validate a file looks like a .jpg/.json/.zip file, you also need to validate that it looks semantically correct (ie. the file format is valid but a chunk of it is missing).

But you don't have to do that perfectly to get value out of it; for example:

- If the .json file parses as json, then at least you probably didn't truncate the download mid-stream.

- If it also contains a particular attribute, then you probably didn't save a structured error response instead of the actual data, or save something from a radically-nonpassively-changed endpoint that might no longer be adequate.

- If it also has roughly the number of elements you expect, you probably didn't miss entire pages of the response.