Hacker News new | ask | show | jobs
by galaxytachyon 1161 days ago
So what I got from this is that GPT was trained on a dataset that biased in English contents. Is that right?

I think even human has to spend extra energy to speak a language they were not born with, no matter how fluent they are in this language. I don't know about natural multilinguals.

5 comments

Nope, it's not about dataset. It's just bad tokenizer. Korean has couple of dozen of symbols in it's alphabet. Cyrillic languages have less than 50 symbols in total. Hiragana is 46 symbols. GPT-4 has 32k tokens IIRC. Including most significant alphabets would take less than a thousand.
> GPT-4 has 32k tokens IIRC. Including most significant alphabets would take less than a thousand.

GPT-4 has much more 32k token vocabulary (GPT-3 seems to have had up to 175k, GPT-2 in the neighborhood of 50k, based on the max value reported for their tokenizers). It has a 32k token context window (that is, the maximum size of prompt + response), not vocab.

But, tokens are generally semantically-significant parts of words (often whole words), not just letters or the equivalent. So, while you might get most alphabets in less than a thousand, you need a lot more than alphabet to handle a language.

I confused LLaMa vocabulary size, which is indeed 32k, with GPT-4 vocab size. Still, my point stands. You can add those characters there with miniscule cost.
> Korean has couple of dozen of symbols in its alphabet.

While that is true (14 consonants, 10 vowels [0]), there are encodings for Korean that encode at the syllable level (where each syllable contains one or two consonants and one vowel) and the combinations for syllables are over 10000 (e.g. 11172 code points listed in Unicode, see [1]).

[0] in practice, more, both to cater for both modern and obsolete forms as well to distinguish the forms based on their position, i.e. with separate encodings for leading vs trailing consonants etc.).

[1] https://en.wikipedia.org/wiki/Hangul_Syllables

In a bizarre coincidence I've just been working on code handling Korean cluster breaks and while it's true there's a lot of codepoints, the rules for handling them are mathematically trivial when considered as codepoint values.

(But I guess I also won't be surprised if the OpenAI guys can't write algorithms worth spit if it's not a large matrix multiplication.)

Including those alphabets as letters or single glyphs would still leave it so that ドイツ would still take 3 tokens whereas "Germany" is one token ("germany" is two tokens: [ger][many]).

And tossing ドイツ into the tokenizer shows that it is 3 tokens.

Consider also the question "is it useful to just tokenize hiragana or katakana and not all of the kanji characters?"

The glyph by glyph approach to tokenization of non-english text is already present the way that you are describing it - and because it is glyph by glyph that means that it gets expanded out and consumes more tokens.

Korean gets rather interesting because 독일 is not one character but several - multiple sounds are combined into one glyph and each glyph is one syllable. That word is 'dog-il' according to google translate. On the first glyph, ㄷ is 'd' and ㅗ is 'o' and ㄱ is a trailing 'g'. On the second glyph ㅣ is 'i' and ㄹ is a trailing 'l'.

Likewise, its GPT tokenization is 5 tokens.

using plain characters would make the sentences longer & cost much more money to use.

that's the idea of byte pair encoding based tokenizers, reduce the average sentence's number of tokens to an optimal (short) size to reduce the computational cost. in this case, most of its training data is in english so it's going to have shorter sentences (nb of tokens) in english vs other languages

but the tokenizer is dataset-driven... it tokenize the most common pattern in your dataset to improve efficiency, so it's 100% about dataset?
There's dataset during training, and dataset for the tokenizer. The confusion here is that people are talking about the former, but you're correct that it's the latter.

Remember, OpenAI's tokenizer was created in an era when 125MB was considered large for a language model. It's hard to fault them for making something that lasted four or five years.

> Remember, OpenAI’s tokenizer was created in an era when 125MB was considered large for a language model.

GPT-2 and GPT-3 have different vocabularies and maximum token #s, which (even if the tokenizer architecture is the same) implies a different tokenizer model. GPT-3.5 might share the GPT-3 tokenizer, but even then I’d expect GPT-4 to have its own.

But even if they are using the tokenizer from GPT-3, its not from “an era when 125MB was considered large for a language model”.

Actually, GPT-3's tokenizer is the same as GPT-2. https://datascience.stackexchange.com/a/109483

You had me questioning myself for a minute.

(The vocab size is still 50257. Even rounded up to a multiple of 128 for better sharding across the vocab embedding, only the first 50257 are used.)

Believe it or not, 125M was large at the start of the GPT-2 era. No one knew LLMs could do anything interesting, let alone that they'd change the world.

I think yes, but more precisely the tokens were chosen to optimize training on a dataset that's biased to English content.

I am curious how the token set affects quality of responses, ignoring the factors related to token count mentioned in the post (cost, prompt expressivity, latency, etc)

Is it always better for the token set to be "native" to the majority of the training dataset and prompts/completions, or is it possible there's some "intermediate representation" (in compiler terms) that would be better?

I don't know what you mean by compiler terms but basically, worse tokenizer = worse LM performance. This is because worse tokenizer means more tokens per sentence so it takes more FLOPs to train on each sentence, on average. So given a fixed training budget, English essentially gets more "learning per token" than other languages.
Data used to train the tokenizer is entirely separate from data training the LLM.

The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.

GPT-4's tokenizer is already far more efficient though still weighted to English.

You can test it here https://tiktokenizer.vercel.app/

Should the cost really be 15x? Or even 5x? In this case, it's not even a question of whether the network is better at English, it's that the cost to communicate with it at all in other languages is higher. Once you pay that cost you now have to deal with the network potentially generating lower quality results for prompts in non-English languages too, which raises the actual cost of doing something with GPT beyond 15x since you probably will need more attempts.
Because there's so much more English language for them to train on relative to most other languages, they're able to do some optimizations for English that they can't elsewhere. Should they not be able to implement optimizations for cases where they have the data volume to do so?
Both of you are kind of misunderstanding a few things. Data used to train the tokenizer is entirely separate from data training the LLM.

The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.

GPT-4's tokenizer is already far more efficient though still weighted to English.

> GPT-4's tokenizer is already far more efficient though still weighted to English.

Right. It's a general question. Should they be allowed to take the kinds of optimizations they can with tokenization when it's a function of how much data they can use, even if that means some languages get more optimization than others? Or should users of those languages that could be optimized effectively pay a tax out of some sense of fairness?

"there's so much more English language for them to train on relative to most other languages" is an interesting assertion. There are billions of people on earth speaking languages other than English and they have access to the internet. Are you sure it's not just the case that we didn't scrape that data?

Everyone has to choose what data to train on, you can't train against The Entire Internet, it's a limitless amount of data. But it becomes an intentional choice with consequences, like the 15.77x seen here.

> Everyone has to choose what data to train on, you can't train against The Entire Internet, it's a limitless amount of data.

Isn't that exactly how OpenAI managed to 10x GPT 3.5 with GPT 4.0?

but training against the entire Internet would still be biased towards English because English is the dominant language used on the Internet.
What makes you think there's a "should"?
There's always a should. Society gets a say in what people and corporations can and can't do in (at the very least) the form of laws. There's your should right there.
Which society?

I mean OpanAI is a US company is unsurprisingly going to mostly communicate in English.

Are we counting all societies, if so should our software cater to all their demands, language and/or culturally demanded?

Both of you are kind of misunderstanding a few things. Data used to train the tokenizer is entirely separate from data training the LLM.

The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.

GPT-4's tokenizer is already far more efficient though still weighted to English.

The article only clarifies that the dataset used to train the tokenizer is baised, not the entire dataset used by the GPT model.