|
|
|
|
|
by 29athrowaway
1168 days ago
|
|
It is not that tokenization is optimized for English, but rather the other way around perhaps. Take "lampara" or "pantalones" in Spanish for example. English speakers were clever enough to shorten those words to "lamp" and "pants" respectively. And they have done this with many words. Translate text into Spanish and you will see text gets longer and there is more meaning encoded into words. "La mesa" refers to a female table, although tables are not lifeforms and have no sex. To me some languages impose a communication tax. It is taboo because people conflate language and culture and such. |
|
BTW, English might have shorter words than many languages, but the sentences get wordier. For example, English "die" is shorter than Czech "umřít", but the sentence "We are going to die." is much longer than "Umřeme." in Czech.