| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by Nyan 1340 days ago

PAR is somewhat unwieldy to use. In addition to needing to explicitly create it (and it not being particularly fast, on a large enough data set), PAR2s can't be 'updated'. The PAR3 spec allows for some limited updating, but it's far from ideal.

It often makes more sense for the file system to deal with ECC in my opinion. PAR probably makes more sense for archived files that aren't expected to change, but may be moved across file systems.

PAR2 handles subfolders by the way, just not empty folders.

1 comments

joshspankit 1340 days ago

No exactly: The current PAR format does not make sense for this use-case (including because of the limitations you mentioned), but IMO the technology does.

Files with on-disk ECC can be moved from cloud to cloud, cloud to desktop, filesystem to filesystem, desktop to stick, then stick to NAS all without losing ECC protection. No single filesystem can do that.

link

Nyan 1340 days ago

Sorry if I'm dense, but what does "this use-case" exactly refer to here?

link

joshspankit 1340 days ago

Fair question. What I’m referring to is file backup and archive for anything up to enterprise level.

So specifically: photography archives, videos (including b-roll for content producers/videographers), project backups, personal files, important documents, etc. Up to and including anything that could be posted to r/datahoarders

link

Nyan 1339 days ago

Ah, PAR makes the most sense for archival material like that. What were you looking for in the PAR format that'd make more sense for this use case?

link

joshspankit 1339 days ago

There are a few shortcomings:

1. “Lots of tiny files”

Some folders unavoidably have tiny files in bulk (Document backups can be like this. One other example that jumps to mind: macOS applications with translation files)

In these cases, PAR/PAR2 have issues with the block size (can only have one file per block which leads to a lot of wasted space)

2. Tracking changes across filenames

This is counterintuitive, but I’ve run in to this enough to mention it: if the item to archive is a folder where the contents might change over time, any single file might get renamed and it’s contents slightly modified. A parity file tool could look at the blocks that have not changed, recognize the rename, and “correct” the reference before doing more processing. If it’s a valid change to the file: saving the work required to recalculate the whole file and if it’s damage to the renamed file: being able to repair it simply.

3. Being able to update in-place

Sometimes the ideal is to create parity files for a folder, even if that folder is actively used (say for example b-roll that changes by 10% maybe once a month). A parity tool could update that 10% without having to recalculate the whole thing (Ideally this would be adding files similar to ‘git add’ so that someone does not accidentally add file damage to the parity set)

link

Nyan 1338 days ago

I see. Yeah, PAR3 tries to address the 'lots of files' scenario better than PAR2.

But updates are problematic. You could 'delete' parts of a recovery file if the data is present. However a file being updated typically means that the original data (before the update) is no longer present, meaning you can't really 'delete' that part of the recovery to replace it with the new data. You could try and retrieve the old data by recovering it, but at this point, it may just be easier to recreate the entire PAR set again.

If it's just adding to the recovery set, PAR3 does have provision for that.

link