| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by xg15 931 days ago

According to the openAI tokenizer[3], GPT-4 sees the following tokens in the above text:

Seems to me, this task depends heavily on the tokenizer, and I'm a bit sceptical if that is really the tokenizer's output. Isn't BPE supposed to result in the longest letter sequences that are in the dictionary?

If you assume that common words like "underneath" and "the" are in the dictionary, the "greedy" tokenization would match the actual words.