Hacker News new | ask | show | jobs
by viscanti 1161 days ago
Because there's so much more English language for them to train on relative to most other languages, they're able to do some optimizations for English that they can't elsewhere. Should they not be able to implement optimizations for cases where they have the data volume to do so?
2 comments

Both of you are kind of misunderstanding a few things. Data used to train the tokenizer is entirely separate from data training the LLM.

The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.

GPT-4's tokenizer is already far more efficient though still weighted to English.

> GPT-4's tokenizer is already far more efficient though still weighted to English.

Right. It's a general question. Should they be allowed to take the kinds of optimizations they can with tokenization when it's a function of how much data they can use, even if that means some languages get more optimization than others? Or should users of those languages that could be optimized effectively pay a tax out of some sense of fairness?

"there's so much more English language for them to train on relative to most other languages" is an interesting assertion. There are billions of people on earth speaking languages other than English and they have access to the internet. Are you sure it's not just the case that we didn't scrape that data?

Everyone has to choose what data to train on, you can't train against The Entire Internet, it's a limitless amount of data. But it becomes an intentional choice with consequences, like the 15.77x seen here.

> Everyone has to choose what data to train on, you can't train against The Entire Internet, it's a limitless amount of data.

Isn't that exactly how OpenAI managed to 10x GPT 3.5 with GPT 4.0?

but training against the entire Internet would still be biased towards English because English is the dominant language used on the Internet.