Hacker News new | ask | show | jobs
by _jayhack_ 20 days ago
For some definitions of better, yes. Chinese is more token efficient for representing fixed text, for example, although this does not always lead to better performance on downstream tasks.
1 comments

True. I suspect it's still hard to tell whether the bottleneck is the language itself, the tokenizer, or just the overwhelming amount of English training data.