| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dragonwriter 1169 days ago

> Remember, OpenAI’s tokenizer was created in an era when 125MB was considered large for a language model.

GPT-2 and GPT-3 have different vocabularies and maximum token #s, which (even if the tokenizer architecture is the same) implies a different tokenizer model. GPT-3.5 might share the GPT-3 tokenizer, but even then I’d expect GPT-4 to have its own.

But even if they are using the tokenizer from GPT-3, its not from “an era when 125MB was considered large for a language model”.

1 comments

sillysaurusx 1169 days ago

Actually, GPT-3's tokenizer is the same as GPT-2. https://datascience.stackexchange.com/a/109483

You had me questioning myself for a minute.

(The vocab size is still 50257. Even rounded up to a multiple of 128 for better sharding across the vocab embedding, only the first 50257 are used.)

Believe it or not, 125M was large at the start of the GPT-2 era. No one knew LLMs could do anything interesting, let alone that they'd change the world.

link