| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gliptic 577 days ago
	I'm pretty sure it doesn't use ranking. That leaves a lot of performance on the table. Instead you would use the actual predicted token probabilities and arithmetic coding.

1 comments

antirez 577 days ago

I supposed it used arithmetic coding with the ranking bacause they have a distribution easy to exploit: zero more likely, one a bit less and so forth. What's your guess? Unfortunately Bellard is as smart as hermetic. We are here guessing what should be a README file.

link

gliptic 577 days ago

The model gives you a probability distribution over the tokens. You could use that directly with arithmetic coding, but there are ways to convert that to a distribution over e.g. the next byte instead which would improve efficiency further by removing the redundancy in alternative token encodings. ts_zip does this, and README says this works similar to ts_zip.

EDIT: Hm, or maybe ts_zip uses just the token probabilities directly. I thought it was slightly more efficient about it.

"The language model predicts the probabilities of the next token. An arithmetic coder then encodes the next token according to the probabilities."

link

antirez 577 days ago

Oh, that makes sense! So they use the probability of the next token itself. Thanks for clarifying. Also clever trick about the multiple potential tokens to represent the same text.

link

bravura 576 days ago

antirez, it's probably identical to the approach in this paper: Li et al 2024, "Evaluating Large Language Models for Generalization and Robustness via Data Compression" (https://ar5iv.labs.arxiv.org/html//2402.00861).

There's a pretty straight line from assigning probabilities (to a sequence of tokens) to arithmetic compression as an optimal compression algorithm for that distribution.

link

gus_massa 577 days ago

If you are going to zip the resulting file, it may be useful to have a lot of 0s.

If you are going to send the result as is, Huffman coding (with some escape for unusal words(?)) will be better. I think even better than the other method that forgets the probabilities and then tries to compresd it.

link

antirez 577 days ago

Just to clarify: even storing ranking, here would likely produce good results, but not as good as storing the probability, since it exploits better the ability of arithmetic coding to store this fractional intervals. But here the fundamental trick is that the LLM can compress the "next in sequence" information in a distribution that is much better to compress than the initial data itself.

link

gliptic 577 days ago

This is especially true for instance when you have two or more tokens that are about equally likely, or one token that is virtually certain, which ranking would obscure.

link

antirez 577 days ago

Indeed.

link

astrange 576 days ago

Arithmetic coding is better than Huffman coding because it can use a fractional number of bits per code, while Huffman has to use a whole number of bits.

IIRC the only reason it wasn't always used for everything is patents. A secondary reason being that it can be slow if you design it without thinking about performance, eg if you use divisions.

link

gliptic 576 days ago

Huffman still has a performance edge for static distributions. ANS bridges some of the performance gap between arithmetic coding and huffman.

link

astrange 573 days ago

But there's no such thing as a static distribution :)

link