|
|
|
|
|
by terafo
1161 days ago
|
|
Nope, it's not about dataset. It's just bad tokenizer. Korean has couple of dozen of symbols in it's alphabet. Cyrillic languages have less than 50 symbols in total. Hiragana is 46 symbols. GPT-4 has 32k tokens IIRC. Including most significant alphabets would take less than a thousand. |
|
GPT-4 has much more 32k token vocabulary (GPT-3 seems to have had up to 175k, GPT-2 in the neighborhood of 50k, based on the max value reported for their tokenizers). It has a 32k token context window (that is, the maximum size of prompt + response), not vocab.
But, tokens are generally semantically-significant parts of words (often whole words), not just letters or the equivalent. So, while you might get most alphabets in less than a thousand, you need a lot more than alphabet to handle a language.