Hacker News new | ask | show | jobs
by cheald 1170 days ago
The gnarly thing is that common dictionary words are usually 1 or 2 tokens (a whole word, or a stem + a suffix), while things like emojis end up as _3_ tokens.

https://platform.openai.com/tokenizer

The string "The GPT family of models process text using tokens" is 10 tokens.

Feeding that string to the "compressor" results in "GPT models: process_text(tokens)", which is...12 tokens. The OP site incorrectly estimates that this is 8 tokens, likely using a naive word boundary regex or something similar.

This is because stuff like punctuation are their own token, and complex words or abbrevations are broken down into one token per piece in the dictionary. The string "ABCDEFGHIJKLMNOP" (16 characters) is 8 tokens (consisting of the bigrams AB, CD, EF, etc), while the string "Counterintuitive" (also 16 characters) is a whopping 2 tokens (likely the tokens for "counter" and "intuitive").

Fewer characters doesn't equal fewer tokens, and in fact, the more esoteric the string, the more likely it is that it consumes an unintuitively large number of tokens.