Hacker News new | ask | show | jobs
by famouswaffles 1168 days ago
Both of you are kind of misunderstanding a few things. Data used to train the tokenizer is entirely separate from data training the LLM.

The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.

GPT-4's tokenizer is already far more efficient though still weighted to English.

1 comments

> GPT-4's tokenizer is already far more efficient though still weighted to English.

Right. It's a general question. Should they be allowed to take the kinds of optimizations they can with tokenization when it's a function of how much data they can use, even if that means some languages get more optimization than others? Or should users of those languages that could be optimized effectively pay a tax out of some sense of fairness?