Hacker News new | ask | show | jobs
by moltensyntax 2984 days ago
This article again? In my opinion, this article is biased. The subtext here is that the author is claiming that his "lzip" format is superior. But xz was not chosen "blindly" as the article claims.

To me, most of the claims are arguable.

To say 3 levels of headers is "unsafe complexity"... I don't agree. Indirection is fundamental to design.

To say padding is "useless"... I don't understand why padding and byte-alignment that is given so much vitriol. Look at how much padding the tar format has. And tar is a good example of how "useless padding" was used to extend the format to support larger files. So this supposed "flaw" has been in tar for dozens of years, with no disastrous effects at all.

The xz decision was not made "blindly". There was thought behind the decision.

And it's pure FUD to say "Xz implementations may choose what subset of the format they support. They may even choose to not support integrity checking at all. Safe interoperability among xz implementations is not guaranteed". You could say this about any software - "oh no, someone might make a bad implementation!" Format fragmentation is essentially a social problem more than a technical problem.

I'll leave it at this for now, but there's more I could write.

3 comments

> To say 3 levels of headers is "unsafe complexity"... I don't agree. Indirection is fundamental to design.

3 individual headers for one file format is unnecessary complexity.

> To say padding is "useless"

Padding in general is not useless, but padding in a compression format is very counterproductive.

> And it's pure FUD to say "Xz implementations may choose what subset of the format they support. They may even choose to not support integrity checking at all. Safe interoperability among xz implementations is not guaranteed". You could say this about any software - "oh no, someone might make a bad implementation!" Format fragmentation is essentially a social problem more than a technical problem.

This isn't about "someone making a bad implementation!", it's about crucial features being optional. That is, completely compliant implementations may or may not be able to decompress a given XZ archive, and may or may not be able to validate the archive.

XZ may not have been chosen blindly, but it certainly does not seem like a sensible format. There is no benefit to this complexity. We do not need or benefit from a format that is flexible, as we can just swap format and tool if we want to swap algorithms, like we have done so many times before (a proper compression format is just a tiny algorithm-specific header + trailing checksum, so it is not worth generalizing away).

Any and all benefits of XZ lie in LZMA2. We could have lzip2 and avoid all of these problems.

(I have no opinion as to whether LZIP should supersede GZIP/BZIP2, but XZ certainly seems like a poor choice.)

> 3 individual headers for one file format is unnecessary complexity.

So all these file formats are unnecessarily complex?

- all OpenDocument formats

- all MS office formats

- all multimedia container formats

- deb/rpm packages

etc?

It depends on how you count headers, but yes.

Multimedia containers, while too complicated, don't really qualify for a position on that list. These containers are basically just special purpose file containers, and thus the headers of the "files" within should not contribute to the header count.

deb/rpm is also a good example for old and quite obnoxious formats. Deb is an AR archive of two GZIP compressed TAR archives (control and data) and a single file (debian-binary). TAR replaced AR for all but a few ancient tasks long ago, but for some reason, Deb uses both. A tar.gz with 3 files/folders that were not tar'd or compressed would have been much simpler. I believe RPM goes that route, but rather than TAR they use CPIO, and rather than embedding the metadata inside the archive, the RPM package has its own header.

Both RPM and DEB have given support for using a bunch of compression formats, meaning that not only do the content of the DEB/RPM package have dependencies, but there each package can now basically end up having its own dependencies that need to be satisfied before you can even read the package in the first place. Oh, and one of the supported compression formats is XZ now, adding an extra dependency as your version of XZ might not support the contained XZ archive at all.

Aren't MS office formats the poster child for overly complex file formats?
> rpm packages

I recall an article posted here detailing how incredibly bloated and crufty the RPM format was.

"Look at how much padding the tar format has. And tar is a good example of how "useless padding" was used to extend the format to support larger files. So this supposed "flaw" has been in tar for dozens of years, with no disastrous effects at all."

Just because it's in tar doesn't mean that the design is flawless. tar was created a long time ago, when a lot of things we are concerned with now weren't even thought of.

Deterministic, bit-reproduceable archives are one thing that tar has recently struggled with[1], because the archive format was not originaly designed with that in mind. With more foresight and a better archive format, this need not have been an issue at all.

[1] - https://lists.gnu.org/archive/html/help-tar/2015-05/msg00005...

The name tar comes from Tape ARchive. Lots of padding makes sense when you know that tar was originally used to write files to magnetic tape, which is highly block oriented. The use of tar today as a bundling and distribution format is something of a misapplication, as it lacks features one might want of such a program.
Thanks for such an amazing rabbit-hole of a link.
I feel he has made a case for some inadequacies in Xz. Some of the claims seem exaggerated, such as (2.2) the optional integrity checking, assuming the decompressor at least logs the fact that it couldn't do the integrity checking. Some others are clearly more significant issues, such as (2.5) not checksumming the length fields (2.6) the variable length integers being able to cause framing errors. Others still are petty, such as (2.3) too many possible filters.

While I think he made a case, I somewhat doubt that the other formats are flawless, and the real answer would lie in a more open analysis of all of them.