Hacker News new | ask | show | jobs
by cpburns2009 2989 days ago
It may not be a good choice for long-term data storage, but I disagree that it should not be used for data sharing or software distribution. Different use cases have different needs. If you need long-term storage, it's better to avoid lossless compression that can break after minor corruption. You should also be storing parity/ECC data (I don't recall the subtle difference). If you only need short to moderate term storage, the best compression ratio is likely optimal. Keep a spare backup just in case.
6 comments

> It may not be a good choice for long-term data storage, but I disagree that it should not be used for data sharing or software distribution. Different use cases have different needs.

I'm not so sure, using tools suitable long-term archiving by default might not be a bad practice. The thing about archiving is that it's often hard to know in advance what exactly you want to keep long-term. Using more robust formats probably won't cost much in the short term, but could pay off in the long term.

For long-term archival I think relying on your compression software to protect data integrity is a fool's errand, protecting against bit-rot should be a function of your storage layer as long as you have control over it (in contrast to say, Usenet, where multiple providers have copies of data and you can't trust them to not lose part of it - hence the inclusion of .par files for everything under alt.binaries).
I keep seeing recommendations for par/par2 but it seems like as software, the project isn't actively maintained? As an aside, that makes me think of dead languages and the use of latin for scientific names because it isn't changing anymore... but do you want that out of archival formats and software?
There's a par2 "fork" under active development - https://github.com/Parchive/par2cmdline

The fork compiled for me this week, when the official 0.3 version on Sourceforge wouldn't. I vaguely remembered par3 being discussed, but couldn't find anything usable. And that's an example of why to be wary of new formats, I guess?

Yes. Learn from what we have already experienced in the world of computing.

* https://www.theguardian.com/uk/2002/mar/03/research.elearnin...

It's probable that PAR2 is essentially feature complete, so no maintenance is really needed.

The program does pretty much the same thing as it did a decade ago.

Nope. You always need end-to-end parity integrity checking. Your data goes through too many layers before reaching the storage medium. E.g. I once got a substantial amount of my pictures filled with bit errors because of a faulty RAM module in my NAS.
This happened to me, and caused me to rethink my approach file management.

Unfortunately, mainstream tooling is largely fire-and-forget and never includes verification (e.g. copying succeeds even if the written data is getting garbled), so one is forced to use multi-step workflows to get around this. It's pretty discouraging that no strong abstractions exist in this space.

Yes, end-to-end checking is a must - but that applies to any method of integrity protection. I could run TrueNAS at home on some old desktop I've retired instead of the used Dell R520 I bought for the task, but I have experienced memory failures before and expect them to happen - this doesn't change if you're using .par files instead.

(People underestimate how frequently memory corruption can actually occur, almost two years ago when Overwatch first came out the game kept crashing - it took me forever to find the cause was a faulty DIMM. Hell, right now the R320 I have in my rack at home has an error indicator because one of my 2 year old Crucial RDIMM's has an excessive amount of correctable errors).

I've used XZ to compress tarballs of backup. XZ was useful so I could store more backups on an external hard drive. I have seen bit rot on some of these files (stored on a magnetic HDD), in the sense that the md5sum of the .tar.xz archive no longer matches when it was created. What do you suggest for creating parity/ECC in this case? I'm aware of parchive, but is that the right choice and in what configuration?
Keep in mind I'm not an archival expert so you should do your own research. That being said, currently I'm using pyFileFixity [1] to generate the hashes and ECC data for my personal backups. I write them to M-Disc Blu-rays using Dvdisaster [2] which can also write additional ECC data. After a lot of googling and reading this useful Super User question [3], and this extensive answer [4] I settled on this setup. I must admit that I am guilty of storing images as JPGs and compressing most most of my files in ZIPs for convenience.

[1]: https://github.com/lrq3000/pyFileFixity

[2]: http://dvdisaster.net/en/index.html

[3]: https://superuser.com/q/374609/52739

[4]: https://superuser.com/a/873260/52739

The whole structural adaptive encoding seems like massive overcomplication. I feel like clever tricks such as that serve only to bite in the ass when you need it the most.

Same goes for the bit jpeg. Sure, it might not be ideal technically, but recommending JPEG2000 (presumably as there is no JPEG2) with its ridiculously poor software support seems weak too. What use is robust file that you can't open?

When you're transferring files and need to cope with corrupted/missing chunks, you should use a parity scheme. Others have mentioned that; it's common for, for example, Usenet.

If you can't control the underlying storage, then ditto. Keeping and maintaining explicit parity chunks is somewhat inconvenient, but it works.

But if you just want to avoid bitrot of your own files, sitting on your own HDD, I'd recommend using a reliable storage system instead. ZFS or, at higher and more complicated levels, Ceph/Rook and its kin. That still offers a posix interface (unlike parity files), while being just as safe.

If I am using a single HDD, can ZFS still add parity data? That's neat if it can. I assumed parity with ZFS was for something like RAID6 where there are multiple HDDs in a set.

Do any other file systems other than ZFS support adding parity in a single HDD config? Last I checked getting ZFS in Linux required lots of side band steps due to licensing issues.

ZFS can do multiple copies of a file on a single hard drive. It is not adding parity.

ZFSOnLinux is developed outside Linux’ tree for 2 reasons. One, it is easier that way and two, Linus does not want it in the main tree. Consequently, you need to install it in addition to the kernel as if it were entirely userspace software. That does not add anymore difficulty than say, installing Google Chrome. :/

I have occasionally had downloaded tarballs that were truncated by network failure. It's nice to be able to get a meaningful error when decompression fails, instead of silently decompressing only part of the data. So built-in integrity checks are also desirable for short-term distribution.
> parity/ECC

Parity is ECC (which is usually Reed-Solomon, which is just a fancy name for a big set of more equations than data chunk you have, so that's how it adds in redundancy) with 1 bit. Usually you should aim for +20-40% redundancy.

Ceph, HDFS and other distributed storage systems implement erasure coding (which is subtly different from error correction coding), which I would recommend for handling backups.

The interesting thing about erasure codes is that you need to checksum your shards independently from the EC itself. If you supply corrupted or wrong shards, you get corrupted data back.

I think for backup (as in small-scale, "fits on one disk") error-correcting codes are not a really good approach, because IME hard disks with one error you notice usually have made many more errors - or will do so shortly. In that case no ECC will help you. If, on the other hand, you're looking at an isolated error, then only very little data is affected (on average).

For example, a bit error in a chunk in a tool like borg/restic will only break that chunk; a piece of a file or perhaps part of a directory listing.

So for these kinds of scenarios "just use multiple backup drives with fully independent backups" is better and simpler.

Vivint published a Go package a while back that does both Reed Solomon and Forward Error Correction: https://innovation.vivint.com/introduction-to-reed-solomon-b...
For small scale, use Dropbox or Google Drive, or whatever, because for small scale the most important part of backup is actually reliably having it done. If you rely on manual process, you're doomed. :)

For large scale in house things: Ceph regurarly does scrubbing of the data. (Compares checksums.) and DreamHost has DreamObjects.

Thanks for mentioning borg/restic, I have never heard of them. (rsnapshot [rsync] works well, but it's not so shiny) Deduplication sounds nice. (rsnapshot uses hardlinks.)

That made me look for something btrfs based, and here's this https://github.com/digint/btrbk seems useful (send btrfs snapshots to a remote somewhere, also can be encrypted), could be useful for small setups.

I think rsync/rsnapshot aren't really appropiate for backups:

(1) They need full support for all FS oddities (xattrs, rforks, acls etc.) wherever you move the data

(2) They don't checksum the data at all.

The newer tools don't have either problem that much: For (1) they pack/unpack these in their own format which doesn't need anything special, so if you move your data twice in a circle you won't lose any (but their support for strange things might not be as polished as e.g. rsync's or GNU coreutils). And for deduplication they have to do (2) with cryptographic hashes.

However (as an ex-dev of one of these) they all have one or the other problem/limitations that won't go away. (Borg has its cache and weak encryption, restic iirc has difficult-to-avoid performance problems with large trees etc.)

Something that nowadays might also need to be discussed is if and how vulnerable your on-line backup is against BREACH-like attacks. E.g. .tar.gz is pretty bad there.

Hm, rsync does MD5 checking automatically. Which doesn't do much against bitrot [0], but it should help with the full circle thing. (And maybe it'll be SHA256+ in newer versions? Though there's not even a ticket in their bugzilla about this. And maybe MD5 is truly enough against random in-transit corruption.)

Yeah, crypto is something that doesn't play well with dedupe, especially if you don't trust the target backup server.

Uh, BREACH was a beast (he-he). I'm a bit still uneasy after thinking about how long these bugs were lurking in OpenSSL. Thankfully the splendid work of Intel engineers quickly diverted the nexus of our bad feels away from such high level matters :|

[0] That's something that the btrfs/ZFS/Ceph should/could fix. (And btrfs supports incremental mode for send+receive.)

Of course there are million variables here, but for compressible data arguably compression+ecc is more robust against damage than uncompressed data. The rationale being that with compression you can afford to use more/bigger ecc