Hacker News new | ask | show | jobs
by famouswaffles 1161 days ago
Data used to train the tokenizer is entirely separate from data training the LLM.

The tokenizer used to train GPT-3 was old, inefficient and targeted at tokenizing English. That's pretty much all there is to it. It's possible to train a tokenizer that is more efficient and more including of other languages.

GPT-4's tokenizer is already far more efficient though still weighted to English.

You can test it here https://tiktokenizer.vercel.app/