Hacker News new | ask | show | jobs
by moelf 1161 days ago
but the tokenizer is dataset-driven... it tokenize the most common pattern in your dataset to improve efficiency, so it's 100% about dataset?
1 comments

There's dataset during training, and dataset for the tokenizer. The confusion here is that people are talking about the former, but you're correct that it's the latter.

Remember, OpenAI's tokenizer was created in an era when 125MB was considered large for a language model. It's hard to fault them for making something that lasted four or five years.

> Remember, OpenAI’s tokenizer was created in an era when 125MB was considered large for a language model.

GPT-2 and GPT-3 have different vocabularies and maximum token #s, which (even if the tokenizer architecture is the same) implies a different tokenizer model. GPT-3.5 might share the GPT-3 tokenizer, but even then I’d expect GPT-4 to have its own.

But even if they are using the tokenizer from GPT-3, its not from “an era when 125MB was considered large for a language model”.

Actually, GPT-3's tokenizer is the same as GPT-2. https://datascience.stackexchange.com/a/109483

You had me questioning myself for a minute.

(The vocab size is still 50257. Even rounded up to a multiple of 128 for better sharding across the vocab embedding, only the first 50257 are used.)

Believe it or not, 125M was large at the start of the GPT-2 era. No one knew LLMs could do anything interesting, let alone that they'd change the world.