Hacker News new | ask | show | jobs
by joshspankit 1335 days ago
I’m constantly surprised that it’s not commonplace to use on-disk parity files.

It’s so uncommon that the PAR3 format was never really finished and no one has created a replacement that handles subfolders.

Why I’m surprised: Not only does it solve the problem of bit-rot, but the parity files can be moved to USB sticks, NAS drives, Mobile devices, etc and the original files can be verified/repaired by any device that understand the parity file format. PAR2 is still great for photos/audio/video, as well as any flat-folder assets.

1 comments

PAR is somewhat unwieldy to use. In addition to needing to explicitly create it (and it not being particularly fast, on a large enough data set), PAR2s can't be 'updated'. The PAR3 spec allows for some limited updating, but it's far from ideal.

It often makes more sense for the file system to deal with ECC in my opinion. PAR probably makes more sense for archived files that aren't expected to change, but may be moved across file systems.

PAR2 handles subfolders by the way, just not empty folders.

No exactly: The current PAR format does not make sense for this use-case (including because of the limitations you mentioned), but IMO the technology does.

Files with on-disk ECC can be moved from cloud to cloud, cloud to desktop, filesystem to filesystem, desktop to stick, then stick to NAS all without losing ECC protection. No single filesystem can do that.

Sorry if I'm dense, but what does "this use-case" exactly refer to here?
Fair question. What I’m referring to is file backup and archive for anything up to enterprise level.

So specifically: photography archives, videos (including b-roll for content producers/videographers), project backups, personal files, important documents, etc. Up to and including anything that could be posted to r/datahoarders

Ah, PAR makes the most sense for archival material like that. What were you looking for in the PAR format that'd make more sense for this use case?
There are a few shortcomings:

1. “Lots of tiny files”

Some folders unavoidably have tiny files in bulk (Document backups can be like this. One other example that jumps to mind: macOS applications with translation files)

In these cases, PAR/PAR2 have issues with the block size (can only have one file per block which leads to a lot of wasted space)

2. Tracking changes across filenames

This is counterintuitive, but I’ve run in to this enough to mention it: if the item to archive is a folder where the contents might change over time, any single file might get renamed and it’s contents slightly modified. A parity file tool could look at the blocks that have not changed, recognize the rename, and “correct” the reference before doing more processing. If it’s a valid change to the file: saving the work required to recalculate the whole file and if it’s damage to the renamed file: being able to repair it simply.

3. Being able to update in-place

Sometimes the ideal is to create parity files for a folder, even if that folder is actively used (say for example b-roll that changes by 10% maybe once a month). A parity tool could update that 10% without having to recalculate the whole thing (Ideally this would be adding files similar to ‘git add’ so that someone does not accidentally add file damage to the parity set)