Hacker News new | ask | show | jobs
by PufPufPuf 1161 days ago
It's funny that you're calling English "effective" because it has shorter words, even though word length has nothing to do with tokenization effectiveness -- if a long word is frequent enough, it becomes a single token. That's the point of doing tokenization instead of feeding raw bytes into the model.

BTW, English might have shorter words than many languages, but the sentences get wordier. For example, English "die" is shorter than Czech "umřít", but the sentence "We are going to die." is much longer than "Umřeme." in Czech.