| HN Mirror

They do, e.g. "gvqbkpwz" is tokenised into individual characters. Actually it was a bit tricky to construct that, since I needed to find letter combinations that are very low probability in tokeniser's training text (e.g. "gv").

So notice it doesn't contain any vowels, since almost all consonant-vowel pairs are sufficiently frequent in the training text as to be tokenised at least as a pair. E.g. "guq" is tokenised as "gu" + "q", since "gu" is common enough.

(Compare "gun" which is just tokenised as a single token "gun", as it's common enough in the training set as a word on its own, so it doesn't need to tokenise it as "gu"+"n".)

The only exceptions I found with consonant-vowel pairs being tokenised as pairs were ones like "qe", tokenised as "q" + "e". Or "qo" as "q"+"o". Which I guess makes sense, given these will be low-frequency pairings in the training text -- compare "qu" just tokenised as "qu".

(Though I didn't test all consonant-vowel pairs, so there may be more).