Hacker News new | ask | show | jobs
by blackle 774 days ago
There is a way to do this same compression by utilizing the raw probability distributions that the LLM produces as output. Fabrice Bellard has some experiments with doing this with transformers: https://bellard.org/nncp/

The idea is that if you can produce an accurate probably distribution over the next bit/byte/token, then you can compress things with an entropy compressor (huffman encoding, range encoding, asymmetric numeral systems, etc). This comment is too small of a space to explain fully how they work, but it suffices to say that pretty much every good compression algorithm models probability distributions in some way.

2 comments

Exactly this. The way to do this is to use the LLM as the statistical model inside an arithmetic coder. You use its next-token probabilities to encode the next token. The KL divergence between the LLM's probabilities, and the empirical probabilities in the corpus that you actually compress, is the compression inefficiency.
> The idea is that if you can produce an accurate probably distribution over the next bit/byte/token...

But how can you get credible probability distributions from the LLMs? My understanding is that the outputs specifically can't be interpreted as a probability distribution, even though superficially they resemble a PMF, due to the way the softmax function tends to predict close to 100% for the predicted token. You can still get an ordered list of most probable tokens (which I think beam search exploits), but they specifically aren't good representations of the output probability distribution since they don't model the variance well.

My understanding is that minimizing perplexity (what LLMs are generally optimized for) is equivalent to finding a good probably distribution over the next token.