Hacker News new | ask | show | jobs
by Silverback_VII 1146 days ago
And what about phonetics? Wouldn't it be easier for the system if it doesn't have to figure it out by itself?
1 comments

Exactly. But practically you have to trade one thing or another.

Before BPE I bailed on a project because the sponsor insisted on using word vectors and I thought "Look, the most important words in our documents will be out-of-dictionary and that's like playing chess down a queen, a rook and two pawns."

Once BPE and similar tokenizers came out now you could say that the model has a chance when it confronts out-of-dictionary situations which will always be important. This was critical to the success of transformers for text.

On the other hand there are many things wrong with tokenization for particular applications. If you want to handle Japanese text you'd think a word like 日本語 "Japanese" should be tokenized as a word or as 日本 + 語 ("japan" + "language")

A multilingual model however is very likely to tokenize those at the unicode character level so you don't even get 日 + 本 + 語 ("sun" + "origin" + "language") but might get underlying UTF-8 bytes like e6 + 97 + a5 + e6 + 9c + ac + e8 + aa + 9e which is just awful.

The trouble is an English language model doesn't want to waste a limited supply of tokens on other languages even though it should be able to handle a few foreign characters. A Japanese language model would clearly make different decisions, a model that supports a large number of languages is going to struggle to allocate tokens between them.

Why is the supply of tokens limited? Are they currently represented as 16 bit unsigned ints (I hear vocab size of about 50k for GPT3)? If so, is there a performance penalty for going to u32 beyond the extra memory?
Tokenization use one-hot encodings, so that matrix will always be n^2 the number of tokens. This has an impact on all the subsequent layers and final number of parameters. You want to use as information dense tokens as possible, while being able to represent weird or unseen tokens, but discrete enough to allow differentiation of concepts.
I would actually be less worried about a sequence of raw bytes than the tokens generated by BPE. If "Japan" is 01 and "language" is 02, then "Japanese" will probably be 03, which has no connection at all to 01 or 02. But raw, verbose encoding slows down convergence at the beginning. (Well, at least in English it does.)