| Exactly. But practically you have to trade one thing or another. Before BPE I bailed on a project because the sponsor insisted on using word vectors and I thought "Look, the most important words in our documents will be out-of-dictionary and that's like playing chess down a queen, a rook and two pawns." Once BPE and similar tokenizers came out now you could say that the model has a chance when it confronts out-of-dictionary situations which will always be important. This was critical to the success of transformers for text. On the other hand there are many things wrong with tokenization for particular applications. If you want to handle Japanese text you'd think a word like 日本語 "Japanese" should be tokenized as a word or as 日本 + 語 ("japan" + "language") A multilingual model however is very likely to tokenize those at the unicode character level so you don't even get 日 + 本 + 語 ("sun" + "origin" + "language") but might get underlying UTF-8 bytes like e6 + 97 + a5 + e6 + 9c + ac + e8 + aa + 9e which is just awful. The trouble is an English language model doesn't want to waste a limited supply of tokens on other languages even though it should be able to handle a few foreign characters. A Japanese language model would clearly make different decisions, a model that supports a large number of languages is going to struggle to allocate tokens between them. |