I'm pretty sure it doesn't use ranking. That leaves a lot of performance on the table. Instead you would use the actual predicted token probabilities and arithmetic coding.
I supposed it used arithmetic coding with the ranking bacause they have a distribution easy to exploit: zero more likely, one a bit less and so forth. What's your guess? Unfortunately Bellard is as smart as hermetic. We are here guessing what should be a README file.
The model gives you a probability distribution over the tokens. You could use that directly with arithmetic coding, but there are ways to convert that to a distribution over e.g. the next byte instead which would improve efficiency further by removing the redundancy in alternative token encodings. ts_zip does this, and README says this works similar to ts_zip.
EDIT: Hm, or maybe ts_zip uses just the token probabilities directly. I thought it was slightly more efficient about it.
"The language model predicts the probabilities of the next token. An arithmetic coder then encodes the next token according to the probabilities."
Oh, that makes sense! So they use the probability of the next token itself. Thanks for clarifying. Also clever trick about the multiple potential tokens to represent the same text.
antirez, it's probably identical to the approach in this paper:
Li et al 2024, "Evaluating Large Language Models for Generalization and Robustness via Data Compression" (https://ar5iv.labs.arxiv.org/html//2402.00861).
There's a pretty straight line from assigning probabilities (to a sequence of tokens) to arithmetic compression as an optimal compression algorithm for that distribution.
If you are going to zip the resulting file, it may be useful to have a lot of 0s.
If you are going to send the result as is,
Huffman coding (with some escape for unusal words(?)) will be better. I think even better than the other method that forgets the probabilities and then tries to compresd it.
Just to clarify: even storing ranking, here would likely produce good results, but not as good as storing the probability, since it exploits better the ability of arithmetic coding to store this fractional intervals. But here the fundamental trick is that the LLM can compress the "next in sequence" information in a distribution that is much better to compress than the initial data itself.
This is especially true for instance when you have two or more tokens that are about equally likely, or one token that is virtually certain, which ranking would obscure.
Arithmetic coding is better than Huffman coding because it can use a fractional number of bits per code, while Huffman has to use a whole number of bits.
IIRC the only reason it wasn't always used for everything is patents. A secondary reason being that it can be slow if you design it without thinking about performance, eg if you use divisions.