|
|
|
|
|
by sillysaurusx
1170 days ago
|
|
There's dataset during training, and dataset for the tokenizer. The confusion here is that people are talking about the former, but you're correct that it's the latter. Remember, OpenAI's tokenizer was created in an era when 125MB was considered large for a language model. It's hard to fault them for making something that lasted four or five years. |
|
GPT-2 and GPT-3 have different vocabularies and maximum token #s, which (even if the tokenizer architecture is the same) implies a different tokenizer model. GPT-3.5 might share the GPT-3 tokenizer, but even then I’d expect GPT-4 to have its own.
But even if they are using the tokenizer from GPT-3, its not from “an era when 125MB was considered large for a language model”.