| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by sebzim4500 775 days ago

On top of what everyone else has said, even if you are able to train your tokenizer on exactly your training dataset it wouldn't remove all these issues.

The way BPE works you can end up with very rare tokens if they get merged with another token. Imagine you have tokens X and Y, and it happens that almost every X is followed by Y. Then the BPE process would make a new token XY but wouldn't remove the old token which would now be undertrained.

I guess to solve this we'd need to use a more sophisticated merging algorithm than the greedy one.