Hacker News new | ask | show | jobs
by pimanrules 1191 days ago
>it couldn't possibly understand how to spell "platoggle" if it's treating it just as a single, never-before-seen, opaque token

That's not how the tokenizer works. A novel word like "platoggle" is decomposed into three separate tokens, "pl", "at", and "oggle". You can see for yourself how prompts are tokenized: https://platform.openai.com/tokenizer

2 comments

Ahh, thank you very much, definitely was missing that piece!
Why don’t they also have single letters as tokens?
They do, e.g. "gvqbkpwz" is tokenised into individual characters. Actually it was a bit tricky to construct that, since I needed to find letter combinations that are very low probability in tokeniser's training text (e.g. "gv").

So notice it doesn't contain any vowels, since almost all consonant-vowel pairs are sufficiently frequent in the training text as to be tokenised at least as a pair. E.g. "guq" is tokenised as "gu" + "q", since "gu" is common enough.

(Compare "gun" which is just tokenised as a single token "gun", as it's common enough in the training set as a word on its own, so it doesn't need to tokenise it as "gu"+"n".)

The only exceptions I found with consonant-vowel pairs being tokenised as pairs were ones like "qe", tokenised as "q" + "e". Or "qo" as "q"+"o". Which I guess makes sense, given these will be low-frequency pairings in the training text -- compare "qu" just tokenised as "qu".

(Though I didn't test all consonant-vowel pairs, so there may be more).

My wild guess is that if it could get things done by tokenising like that all the time, they wouldn't need to also have word-like tokens.

If that is a inference time performance or training time performance or a model size issue or just total nonsense, I wouldn't know.