|
|
|
|
|
by famouswaffles
1161 days ago
|
|
Data used to train the tokenizer is entirely separate from data training the LLM. The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages. GPT-4's tokenizer is already far more efficient though still weighted to English. You can test it here
https://tiktokenizer.vercel.app/ |
|