Hacker News new | ask | show | jobs
by gus_massa 540 days ago
If you are going to zip the resulting file, it may be useful to have a lot of 0s.

If you are going to send the result as is, Huffman coding (with some escape for unusal words(?)) will be better. I think even better than the other method that forgets the probabilities and then tries to compresd it.

2 comments

Just to clarify: even storing ranking, here would likely produce good results, but not as good as storing the probability, since it exploits better the ability of arithmetic coding to store this fractional intervals. But here the fundamental trick is that the LLM can compress the "next in sequence" information in a distribution that is much better to compress than the initial data itself.
This is especially true for instance when you have two or more tokens that are about equally likely, or one token that is virtually certain, which ranking would obscure.
Indeed.
Arithmetic coding is better than Huffman coding because it can use a fractional number of bits per code, while Huffman has to use a whole number of bits.

IIRC the only reason it wasn't always used for everything is patents. A secondary reason being that it can be slow if you design it without thinking about performance, eg if you use divisions.

Huffman still has a performance edge for static distributions. ANS bridges some of the performance gap between arithmetic coding and huffman.
But there's no such thing as a static distribution :)