Hacker News new | ask | show | jobs
by ddod 1170 days ago
The original tweet author eventually realized this didn't decrease tokens. In most cases it actually increases them compared to just asking GPT to summarize while retaining all functional data. If a word == 1 token, a related emoji will also == 1 token.
4 comments

The gnarly thing is that common dictionary words are usually 1 or 2 tokens (a whole word, or a stem + a suffix), while things like emojis end up as _3_ tokens.

https://platform.openai.com/tokenizer

The string "The GPT family of models process text using tokens" is 10 tokens.

Feeding that string to the "compressor" results in "GPT models: process_text(tokens)", which is...12 tokens. The OP site incorrectly estimates that this is 8 tokens, likely using a naive word boundary regex or something similar.

This is because stuff like punctuation are their own token, and complex words or abbrevations are broken down into one token per piece in the dictionary. The string "ABCDEFGHIJKLMNOP" (16 characters) is 8 tokens (consisting of the bigrams AB, CD, EF, etc), while the string "Counterintuitive" (also 16 characters) is a whopping 2 tokens (likely the tokens for "counter" and "intuitive").

Fewer characters doesn't equal fewer tokens, and in fact, the more esoteric the string, the more likely it is that it consumes an unintuitively large number of tokens.

> Unicode characters like emojis may be split into many tokens containing the underlying bytes: ������ [<- this is a single emoji]

Source: https://platform.openai.com/tokenizer

Well, that makes sense! Otherwise, it does reduce tokens if the result does not contain emojis
Any good prompt suggestions for this task?

"summarize the following while retaining all functional data:"

doesn't seem to do the trick.