Hacker News new | ask | show | jobs
by andrewmutz 1500 days ago
I believe GPT-3 uses byte pair encoding, which allows it to do tokenization in a language-neutral manner:

https://en.wikipedia.org/wiki/Byte_pair_encoding

3 comments

Yeah it's BPE. OpenAI has a nice tool that allows you to play with the tokenizer https://beta.openai.com/tokenizer.
Additionally, the tokenizer vocabulary is unchanged from GPT-2.

You can use HuggingFace's GPT-2 tokenizer as well. (some of OpenAI's GPT-3 notebooks do just that).

I thought I read it uses word2vec?