| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dragonwriter 1166 days ago

> GPT-4 has 32k tokens IIRC. Including most significant alphabets would take less than a thousand.

GPT-4 has much more 32k token vocabulary (GPT-3 seems to have had up to 175k, GPT-2 in the neighborhood of 50k, based on the max value reported for their tokenizers). It has a 32k token context window (that is, the maximum size of prompt + response), not vocab.

But, tokens are generally semantically-significant parts of words (often whole words), not just letters or the equivalent. So, while you might get most alphabets in less than a thousand, you need a lot more than alphabet to handle a language.

1 comments

terafo 1166 days ago

I confused LLaMa vocabulary size, which is indeed 32k, with GPT-4 vocab size. Still, my point stands. You can add those characters there with miniscule cost.

link