|
|
|
|
|
by famouswaffles
1161 days ago
|
|
Both of you are kind of misunderstanding a few things. Data used to train the tokenizer is entirely separate from data training the LLM. The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages. GPT-4's tokenizer is already far more efficient though still weighted to English. |
|