|
|
|
|
|
by nl
247 days ago
|
|
Tokenizers used to be 1 character per token. Then Google implemented Subword encoding[1] on their early neural translation work and found it was much better. Subword units are genuinely meaningful in most languages. You do need to tune the vocabulary size though. [1] https://aclanthology.org/P16-1162/ |
|