Hacker News new | ask | show | jobs
by axiom92 1500 days ago
50k is not the number of unique words that GPT-3 supports, and perhaps you're referring to the BPE tokens. The input to GPT-3 is not tokenized by splitting on spaces, and is based on byte-pair encoding tokens. You can play with it here: https://beta.openai.com/tokenizer.

A rare word like blithe is tokenized into two BPE tokens: bl and ithe, whereas common words like the get their own token.