Hacker News new | ask | show | jobs
by terafo 1161 days ago
Nope, it's not about dataset. It's just bad tokenizer. Korean has couple of dozen of symbols in it's alphabet. Cyrillic languages have less than 50 symbols in total. Hiragana is 46 symbols. GPT-4 has 32k tokens IIRC. Including most significant alphabets would take less than a thousand.
5 comments

> GPT-4 has 32k tokens IIRC. Including most significant alphabets would take less than a thousand.

GPT-4 has much more 32k token vocabulary (GPT-3 seems to have had up to 175k, GPT-2 in the neighborhood of 50k, based on the max value reported for their tokenizers). It has a 32k token context window (that is, the maximum size of prompt + response), not vocab.

But, tokens are generally semantically-significant parts of words (often whole words), not just letters or the equivalent. So, while you might get most alphabets in less than a thousand, you need a lot more than alphabet to handle a language.

I confused LLaMa vocabulary size, which is indeed 32k, with GPT-4 vocab size. Still, my point stands. You can add those characters there with miniscule cost.
> Korean has couple of dozen of symbols in its alphabet.

While that is true (14 consonants, 10 vowels [0]), there are encodings for Korean that encode at the syllable level (where each syllable contains one or two consonants and one vowel) and the combinations for syllables are over 10000 (e.g. 11172 code points listed in Unicode, see [1]).

[0] in practice, more, both to cater for both modern and obsolete forms as well to distinguish the forms based on their position, i.e. with separate encodings for leading vs trailing consonants etc.).

[1] https://en.wikipedia.org/wiki/Hangul_Syllables

In a bizarre coincidence I've just been working on code handling Korean cluster breaks and while it's true there's a lot of codepoints, the rules for handling them are mathematically trivial when considered as codepoint values.

(But I guess I also won't be surprised if the OpenAI guys can't write algorithms worth spit if it's not a large matrix multiplication.)

Including those alphabets as letters or single glyphs would still leave it so that ドイツ would still take 3 tokens whereas "Germany" is one token ("germany" is two tokens: [ger][many]).

And tossing ドイツ into the tokenizer shows that it is 3 tokens.

Consider also the question "is it useful to just tokenize hiragana or katakana and not all of the kanji characters?"

The glyph by glyph approach to tokenization of non-english text is already present the way that you are describing it - and because it is glyph by glyph that means that it gets expanded out and consumes more tokens.

Korean gets rather interesting because 독일 is not one character but several - multiple sounds are combined into one glyph and each glyph is one syllable. That word is 'dog-il' according to google translate. On the first glyph, ㄷ is 'd' and ㅗ is 'o' and ㄱ is a trailing 'g'. On the second glyph ㅣ is 'i' and ㄹ is a trailing 'l'.

Likewise, its GPT tokenization is 5 tokens.

using plain characters would make the sentences longer & cost much more money to use.

that's the idea of byte pair encoding based tokenizers, reduce the average sentence's number of tokens to an optimal (short) size to reduce the computational cost. in this case, most of its training data is in english so it's going to have shorter sentences (nb of tokens) in english vs other languages

but the tokenizer is dataset-driven... it tokenize the most common pattern in your dataset to improve efficiency, so it's 100% about dataset?
There's dataset during training, and dataset for the tokenizer. The confusion here is that people are talking about the former, but you're correct that it's the latter.

Remember, OpenAI's tokenizer was created in an era when 125MB was considered large for a language model. It's hard to fault them for making something that lasted four or five years.

> Remember, OpenAI’s tokenizer was created in an era when 125MB was considered large for a language model.

GPT-2 and GPT-3 have different vocabularies and maximum token #s, which (even if the tokenizer architecture is the same) implies a different tokenizer model. GPT-3.5 might share the GPT-3 tokenizer, but even then I’d expect GPT-4 to have its own.

But even if they are using the tokenizer from GPT-3, its not from “an era when 125MB was considered large for a language model”.

Actually, GPT-3's tokenizer is the same as GPT-2. https://datascience.stackexchange.com/a/109483

You had me questioning myself for a minute.

(The vocab size is still 50257. Even rounded up to a multiple of 128 for better sharding across the vocab embedding, only the first 50257 are used.)

Believe it or not, 125M was large at the start of the GPT-2 era. No one knew LLMs could do anything interesting, let alone that they'd change the world.