| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gus_massa 540 days ago
	If you are going to zip the resulting file, it may be useful to have a lot of 0s. If you are going to send the result as is, Huffman coding (with some escape for unusal words(?)) will be better. I think even better than the other method that forgets the probabilities and then tries to compresd it.

2 comments

antirez 540 days ago

Just to clarify: even storing ranking, here would likely produce good results, but not as good as storing the probability, since it exploits better the ability of arithmetic coding to store this fractional intervals. But here the fundamental trick is that the LLM can compress the "next in sequence" information in a distribution that is much better to compress than the initial data itself.

link

gliptic 540 days ago

This is especially true for instance when you have two or more tokens that are about equally likely, or one token that is virtually certain, which ranking would obscure.

link

antirez 540 days ago

Indeed.

link

astrange 539 days ago

Arithmetic coding is better than Huffman coding because it can use a fractional number of bits per code, while Huffman has to use a whole number of bits.

IIRC the only reason it wasn't always used for everything is patents. A secondary reason being that it can be slow if you design it without thinking about performance, eg if you use divisions.

link

gliptic 539 days ago

Huffman still has a performance edge for static distributions. ANS bridges some of the performance gap between arithmetic coding and huffman.

link

astrange 537 days ago

But there's no such thing as a static distribution :)

link