Hacker News new | ask | show | jobs
by klauspost 1507 days ago
Looks interesting, but my main objections to general adoption the same as bzip2, lzma and context modelling based codecs - decompression speed.

Compressing logs for instance, decompression speed of 23MB/s per core, is simply too slow when you need to grep through gigabytes of data. Same for data analysis, you don't want your input speed to be this limited when analysing gigabytes of data.

I am not sure how I feel about you "stealing" the bzip name. While the author of bzip2 doesn't seem to plan to release a follow-up, I feel it is bad manner to take over a name like this.

4 comments

> I am not sure how I feel about you "stealing" the bzip name. While the author of bzip2 doesn't seem to plan to release a follow-up, I feel it is bad manner to take over a name like this.

I think it boils down to the feelings of the author (of the previous format).

I don't think PKWARE feels bad because ZSTD is a homage to ZIP. Similarly if someone created a follow-up file format to something I've designed, I'd just want credit or a link to my version as a homage and pointer for history continuity, nothing else.

Open source software is designed to be mangled, modified, shared and leapfrogged. If a completely different implementation advertises itself as a newer iteration of a format because it's built on the same theory, I think it's ethical if the developer is not intending to capitalize name. Either case, if the original developer returns to the game, it can create a BZIP4 and points to the diversion as, "hey, somebody liked BZIP2 too much and created this, give him/her a kudos", and continue.

Pkware might not have been so forgiving if someone had released ZIP2. Incrementing the version number like that is only an acceptable thing in relatively unusual circumstances, but does happen sometimes; and still, I would really hesitate to say that it’s a good idea for a third party to call itself bzip3.

The original author replaced their own bzip release with bzip2 to avoid a patent issue with arithmetic coding, so this is the first time a third party has done so: https://web.archive.org/web/19980704181204/http://www.muraro...

So if the release of bzip3 is approved by the current maintainer, then I guess it’s fine, but otherwise it makes me uncomfortable to consider using under this name.

> Open source software is designed to be mangled, modified, shared and leapfrogged.

I agree in spirit, but I can also see why someone might want their source to be free and mangleable, but still care about trademarks.

(Just imagine Linus Torvalds getting lots of emails with support requests for a hypothetical Linux2 operating system that I wrote, and that he has no relation with. That could become pretty annoying; even if he doesn't mind me taking his source code.)

A trademark changes the situation in every aspect. If your code is getting big, famous, and needs its name and likeness (e.g. Firefox), you get a trademark, and your derivatives shall not use the name. It's simple and clear.

When I quickly looked, bzip2 didn't have a trademark, and I assumed the developer don't care.

It's same for me. If I care, I'd trademark it, and prevent people from using it. If I'm giving the name away, I'd not trademark it. It's plain and simple.

So, a trademark is a 'legal' concept, and comes with certain rights and obligations. Eg you need to actively defend your trademark in order to keep it.

I was perhaps a bit fuzzy: I used the word trademark, but I meant the moral equivalent that only confers moral rights and obligations. Ie don't be a jerk and name your stuff in a confusing way, even if there's no legal obligation.

Even this fuzzier 'moral' trademark only applies, if the original owner wants it to apply. I have no special insight into the bzip2 situation. So I have no clue whether the authors of bzip or bzip2 cared. I gladly defer to your research on this question.

My point was meant purely abstract about the possibility of wanting your source to be free and open and fondled by others, but still have a (morally) protected name.

> I don't think PKWARE feels bad because ZSTD is a homage to ZIP.

Is zstd actually an homage to zip?

I'm not saying that it definitely isn't, but the only connection I know of myself is that they both begin with the letter Z, and the letter Z has a long association with data compression that goes back before the zip format / pkzip program.

The LZ77, LZ78[1], and LZW[2] algorithms all predate the zip format. As do two very old, obsolete Unix compression programs: "pack"[3], which is uses a ".z" suffix for compressed files, and "compress"[4], which uses a ".Z" suffix for compressed files.

In those algorithm names, the L and Z stand for Lempel and Ziv, respectively. But interestingly, the Unix "pack" program uses a ".z" suffix even though its algorithm is just Huffman (not one of the Lempel-Ziv family of algorithms), so the letter Z somehow came to signify data compression more generally.

Rough timeline of letter Z in data compression:

1977: LZ77

1978: LZ78

1982 or earlier: Unix "pack" (.z)

1984: LZW

1985: Unix "compress" (.Z)

1989: PKZIP

---

[1] https://en.wikipedia.org/wiki/LZ77_and_LZ78

[2] https://en.wikipedia.org/wiki/Lempel-Ziv-Welch

[3] https://en.wikipedia.org/wiki/Pack_(compression)

[4] https://en.wikipedia.org/wiki/Compress

> Is zstd actually an homage to zip?

I think so. Or more generally, DEFLATE family of algorithms used by zlib and ZIP.

> I'm not saying that it definitely isn't, but the only connection I know of myself is that they both begin with the letter Z, and the letter Z has a long association with data compression that goes back before the zip format / pkzip program.

> The LZ77, LZ78[1], and LZW[2] algorithms all predate the zip format.

I'm well aware of the Ziv, Lempel & Welch's work, since I've developed a compression algorithm [*] and did extensive research on the family before, however the usage of extension is a rather new information for me.

On the other side, official repository of zstd [0] lists zlib and other libraries which use DEFLATE in one way other, pointing back to Ziv's ZIP at the same time.

At the end of the day, it might be tipping its hat directly to "ZIP" per se, but to general direction of DEFLATE which is used by ZIP format too.

[*]: The algorithm I developed was working on syllables rather than bytes. It had a deterministic and fast hyphenation engine for the language, and used embedded dictionary to minimize bit-flip damage during transit. I have paper published from it, and still planning to re-implement and open it, since the PoC was beyond bad from a code quality perspective.

Right, there's definitely a connection in the algorithms. They're all derivatives of the same family of stuff.

I could've been clearer, but when I said the only connection I saw between them was the Z, I meant the only connection between their names. In other words, while zstd surely has zip among its main influences, I'm not sure that the Z in "zstd" is there to convey "zip-like". It could be, or maybe it's just there because Z is generally associated with data compression.

That's funny, 23 MiB/s is exactly what I get for reading systemd logs (on an NVME SSD). Is it supposed to be otherwise?

    $ sudo journalctl -r | pv -a > /dev/null
    [22.8MiB/s]
That appears to be systemd being slow.

  $ dd if=/dev/urandom of=test bs=1G count=1 iflag=fullblock
  $ gzip -k test
  $ zcat test.gz | pv -a >/dev/null
  [ 228MiB/s]

  $ sudo journalctl -r | pv -a >/dev/null
  [13.1MiB/s]
UPDATE: Gzip with more real-world data[1]:

  $ gzip -k adventures-of-huckleberry-finn.txt
  $ zcat adventures-of-huckleberry-finn.txt.gz | pv -a >/dev/null
  [ 151MiB/s]
[1]: <https://gutenberg.org/files/76/76-0.txt>
By "compressing" random data you are bypassing gzip, since it will just store your data as uncompressed blocks, making "decompression" a memcopy.

With real data, deflate maxes out somewhere around there either way, but that is a bit coincidental.

With modern CPUs getting increasingly smaller IPC improvements this will likely be pretty much the max decompression speed we can expect from gzip going forward.

It actually made the file bigger (1.1G from 1.0G) :)

I was getting the same numbers with text data I have scattered on my disk, but those were small, so I decided to generate a bigger file. But, yes, I agree a more robust benchmark would use a Mark Twain novel e.g.

I think the lack of speed here is more that it has to serialize the data from disk into a readable format. I assume using the `--grep=` option is faster than piping it through grep because of this

    --grep=
That's exactly what I needed to know! I'm glad I asked the stupid question. Thank you!
With the other parts of the codec, I doubt it's possible for files compressed with this, but one of the strengths of BWT-based compression is that there has been a lot of research on search operations directly on compressed data.
If you’re using xz, pixz can do multithreaded decompression. It’s still xz/lzma, so still expensive to decompress, but at least that allows you to throw as many cores as you want at it.