Hacker News new | ask | show | jobs
by antirez 541 days ago
Just to clarify: even storing ranking, here would likely produce good results, but not as good as storing the probability, since it exploits better the ability of arithmetic coding to store this fractional intervals. But here the fundamental trick is that the LLM can compress the "next in sequence" information in a distribution that is much better to compress than the initial data itself.
1 comments

This is especially true for instance when you have two or more tokens that are about equally likely, or one token that is virtually certain, which ranking would obscure.
Indeed.