Hacker News new | ask | show | jobs
by Der_Einzige 1500 days ago
Part of the problem here is that GPT-3 has such a small vocabulary. It's 50K tokens, and many of those are either garbage, punctuation, or full words (rather than sub words).

I'd be curious to see what scaling up the size of the vocabulary would do to improve these results in a model like GPT-3...

2 comments

50k is not the number of unique words that GPT-3 supports, and perhaps you're referring to the BPE tokens. The input to GPT-3 is not tokenized by splitting on spaces, and is based on byte-pair encoding tokens. You can play with it here: https://beta.openai.com/tokenizer.

A rare word like blithe is tokenized into two BPE tokens: bl and ithe, whereas common words like the get their own token.

I don't think a larger vocab would help. All the individual letters are in the ~50k token vocab already, but the word "alphabet" will still not get tokenized to [a, l, p, h, a, b, e, t]. Using a larger vocab like PaLM's 256k vocab would have the same issue.