Hacker News new | ask | show | jobs
by PaulHoule 651 days ago
To tease out the issues: I see the BPE algorithm as similar to Huffman coding in the way it builds the token list by creating, at any point, a token that is optimal in "compressing" the data in that it represents a commonly occuring sequence.

As BPE is going to create tokens for frequent words it is going to gain over just representing over single-byte tokens like "a" or 0xb7 however you get a loss in that "a" is getting puffed up to two bytes or more.

In an ordinary implementation of BPE I think the order of the token ids is not relevant so depending on how it was coded up I wouldn't depend on "lower token # = less frequent" but you don't need to leave it to chance and you could just sort the tokens by frequency. Then if you wanted to compress the token sequence you could use something like

https://en.wikipedia.org/wiki/Golomb_coding

or the algorithm used in UTF-8 encoding to use fewer bits for the more common tokens. This doesn't affect the neural network part but it would let you encode the token sequence in fewer bits than it would take otherwise. I'm not sure how it competes with more commonly used compression algorithms. On the other hand if you want to use the whole neural network to compress you can do way better

https://arxiv.org/abs/2306.04050

Either using the tokens directly or the LLM you have the disadvantage that the decompressor needs a huge data structure which has been trained on a lot of text, but boy do the people in that last paper get a result way better than has been reported so far.