Hacker News new | ask | show | jobs
by jamesk_au 1108 days ago
Can anyone comment on how the limits on GPT-X’s token space translate to limits on its vocabulary (with corresponding limits on understanding input and generating output)?

For example, is GPT-4’s list of ~100k tokens sufficient to understand and generate every non-obsolete word in the English language (per, say, a standard dictionary)? Or even every word in the training data?

If not, do we have examples of ordinary words that it is impossible for GPT-4 ever to understand or generate? What happens when it encounters those words and is unable to tokenize them; are they simply ignored (eg omitted from the input vector, or set to 0 or some sort of null token)?

2 comments

IIRC from poking around in the LLaMA internals (I assume ChatGPT is the same since it’s the obvious way to handle this): the token list has a complete set of tokens of length 1. This means that in the degenerate case where the tokenizer can’t compose the text out of any other tokens it’ll still be processable, just as a collection of single-character tokens that the language model presumably has vaguer associations for. (Which I imagine doesn’t actually affect things significantly; if you added more tokens for less-frequently-seen strings, it still wouldn’t have much of an idea what to do with them.)
You are almost correct, though it doesn't happen at character level, it happens at byte level. Most characters are in LLaMA tokenizer's vocabulary, but all characters aren't. So if you use a character that was uncommon in the training material, it will fall back to byte-level tokens. In most cases 1 character can be represented as 1 byte (and thus 1 byte-level token). However, some characters require more than 1 byte in UTF-8; those characters might end up with as much as 4 tokens.
> However, some characters require more than 1 byte in UTF-8; those characters might end up with as much as 4 tokens.

This would seem to raise an interesting "prompt golf" challenge: find a reasonable-sounding prompt that causes the language model to generate invalid UTF-8 in its output.

My current understanding is that the lack of a token for a specific word does nothing to prevent that word from being "understood" or produced in output - GPT-4 is very capable of consuming and producing text in languages such as Spanish despite most Spanish words not corresponding to a single token.
For Russian text, it degrades to, basically, 1 character = 1 token, due to the tokenization issues discussed in the article, yet it produces absolutely coherent text, almost same as in English. In my tests, its Russian output is worse than English output, though, something like 80% quality I'd say. I'm not an LLM expert but I have a theory that, being mostly trained on English text, its thought processes actually happen in English (the part of the model which was trained on English text) and for Russian, it's able to map English to Russian and back thanks to its language translation ability, because I've noticed sometimes it produces slightly awkward sentences whose word choice makes sense in English (calques?) and not as much in Russian.
I think the same. LLMs are actually sort of "multi-linguas", able to transform source of any language to internal representation and then do output in some other language, thanks to so many layers of neurons inside it.