There's dataset during training, and dataset for the tokenizer. The confusion here is that people are talking about the former, but you're correct that it's the latter.
Remember, OpenAI's tokenizer was created in an era when 125MB was considered large for a language model. It's hard to fault them for making something that lasted four or five years.
> Remember, OpenAI’s tokenizer was created in an era when 125MB was considered large for a language model.
GPT-2 and GPT-3 have different vocabularies and maximum token #s, which (even if the tokenizer architecture is the same) implies a different tokenizer model. GPT-3.5 might share the GPT-3 tokenizer, but even then I’d expect GPT-4 to have its own.
But even if they are using the tokenizer from GPT-3, its not from “an era when 125MB was considered large for a language model”.
(The vocab size is still 50257. Even rounded up to a multiple of 128 for better sharding across the vocab embedding, only the first 50257 are used.)
Believe it or not, 125M was large at the start of the GPT-2 era. No one knew LLMs could do anything interesting, let alone that they'd change the world.
Remember, OpenAI's tokenizer was created in an era when 125MB was considered large for a language model. It's hard to fault them for making something that lasted four or five years.