Hacker News new | ask | show | jobs
by arghwhat 2984 days ago
> To say 3 levels of headers is "unsafe complexity"... I don't agree. Indirection is fundamental to design.

3 individual headers for one file format is unnecessary complexity.

> To say padding is "useless"

Padding in general is not useless, but padding in a compression format is very counterproductive.

> And it's pure FUD to say "Xz implementations may choose what subset of the format they support. They may even choose to not support integrity checking at all. Safe interoperability among xz implementations is not guaranteed". You could say this about any software - "oh no, someone might make a bad implementation!" Format fragmentation is essentially a social problem more than a technical problem.

This isn't about "someone making a bad implementation!", it's about crucial features being optional. That is, completely compliant implementations may or may not be able to decompress a given XZ archive, and may or may not be able to validate the archive.

XZ may not have been chosen blindly, but it certainly does not seem like a sensible format. There is no benefit to this complexity. We do not need or benefit from a format that is flexible, as we can just swap format and tool if we want to swap algorithms, like we have done so many times before (a proper compression format is just a tiny algorithm-specific header + trailing checksum, so it is not worth generalizing away).

Any and all benefits of XZ lie in LZMA2. We could have lzip2 and avoid all of these problems.

(I have no opinion as to whether LZIP should supersede GZIP/BZIP2, but XZ certainly seems like a poor choice.)

1 comments

> 3 individual headers for one file format is unnecessary complexity.

So all these file formats are unnecessarily complex?

- all OpenDocument formats

- all MS office formats

- all multimedia container formats

- deb/rpm packages

etc?

It depends on how you count headers, but yes.

Multimedia containers, while too complicated, don't really qualify for a position on that list. These containers are basically just special purpose file containers, and thus the headers of the "files" within should not contribute to the header count.

deb/rpm is also a good example for old and quite obnoxious formats. Deb is an AR archive of two GZIP compressed TAR archives (control and data) and a single file (debian-binary). TAR replaced AR for all but a few ancient tasks long ago, but for some reason, Deb uses both. A tar.gz with 3 files/folders that were not tar'd or compressed would have been much simpler. I believe RPM goes that route, but rather than TAR they use CPIO, and rather than embedding the metadata inside the archive, the RPM package has its own header.

Both RPM and DEB have given support for using a bunch of compression formats, meaning that not only do the content of the DEB/RPM package have dependencies, but there each package can now basically end up having its own dependencies that need to be satisfied before you can even read the package in the first place. Oh, and one of the supported compression formats is XZ now, adding an extra dependency as your version of XZ might not support the contained XZ archive at all.

Aren't MS office formats the poster child for overly complex file formats?
> rpm packages

I recall an article posted here detailing how incredibly bloated and crufty the RPM format was.