|
|
|
|
|
by ddod
1170 days ago
|
|
The original tweet author eventually realized this didn't decrease tokens. In most cases it actually increases them compared to just asking GPT to summarize while retaining all functional data. If a word == 1 token, a related emoji will also == 1 token. |
|
https://platform.openai.com/tokenizer
The string "The GPT family of models process text using tokens" is 10 tokens.
Feeding that string to the "compressor" results in "GPT models: process_text(tokens)", which is...12 tokens. The OP site incorrectly estimates that this is 8 tokens, likely using a naive word boundary regex or something similar.
This is because stuff like punctuation are their own token, and complex words or abbrevations are broken down into one token per piece in the dictionary. The string "ABCDEFGHIJKLMNOP" (16 characters) is 8 tokens (consisting of the bigrams AB, CD, EF, etc), while the string "Counterintuitive" (also 16 characters) is a whopping 2 tokens (likely the tokens for "counter" and "intuitive").
Fewer characters doesn't equal fewer tokens, and in fact, the more esoteric the string, the more likely it is that it consumes an unintuitively large number of tokens.