|
|
|
|
|
by lopuhin
2402 days ago
|
|
The paper rises a great point on tokenization affecting perplexity, that we can't compare perplexities of different tokenizers even re-normalizing taking token counts into account, say BPE vs word tokenization. This example nails it: https://twitter.com/Smerity/status/1192252147598909441 |
|
The main disadvantage of word-level models is large vocabulary size, however, the tweet completely ignores the advantage--sequence length becomes shorter, it has to look only a few tokens back to find the reference to "Bob" and "Alice".
The same model at word level writes more sensible sentences than at character level. There's a tradeoff between larger vocabulary and modelling longer dependencies. A model which can encode a text document more effectively is better; tokenization is just a part of the modelling. You just need to take care of the "number of words" of "per word" part of "perplexity per word" and you can directly compare their performances.
The author is wrong that entropy collapses after "A" is given of "Alice". Entropy will only collapse if the model has really "understood" the context and modelled that "Bob" and "Alice" are the only options here. The entropy won't collapse for a sentencepice based bi-gram model, for example.
In his example, it is not clear if the wordpiece model is at an advantage. Suppose both the models "understand" that there are two options "Bob" and "Alice". Then the word-level model only has to predict one token which can be either of the names. Perplexity = 0.5. The sentence-piece model also has to choose between two tokens "B" and "A", the second token won't add to perplexity since it'll be known. Perplexity = 0.5.