| HN Mirror

I don't see his point. Doesn't renormalizing token counts essentially eliminate the effect of tokenization? The perplexity which then we get essentially is representative of how well a model compresses the test document. Isn't that the whole point? A better model compresses the document better, how does it matter if you model each character or each word or bigrams or even directly the bits?

The main disadvantage of word-level models is large vocabulary size, however, the tweet completely ignores the advantage--sequence length becomes shorter, it has to look only a few tokens back to find the reference to "Bob" and "Alice".

The same model at word level writes more sensible sentences than at character level. There's a tradeoff between larger vocabulary and modelling longer dependencies. A model which can encode a text document more effectively is better; tokenization is just a part of the modelling. You just need to take care of the "number of words" of "per word" part of "perplexity per word" and you can directly compare their performances.

The author is wrong that entropy collapses after "A" is given of "Alice". Entropy will only collapse if the model has really "understood" the context and modelled that "Bob" and "Alice" are the only options here. The entropy won't collapse for a sentencepice based bi-gram model, for example.

In his example, it is not clear if the wordpiece model is at an advantage. Suppose both the models "understand" that there are two options "Bob" and "Alice". Then the word-level model only has to predict one token which can be either of the names. Perplexity = 0.5. The sentence-piece model also has to choose between two tokens "B" and "A", the second token won't add to perplexity since it'll be known. Perplexity = 0.5.