Hacker News new | ask | show | jobs
by kgeist 1108 days ago
For Russian text, it degrades to, basically, 1 character = 1 token, due to the tokenization issues discussed in the article, yet it produces absolutely coherent text, almost same as in English. In my tests, its Russian output is worse than English output, though, something like 80% quality I'd say. I'm not an LLM expert but I have a theory that, being mostly trained on English text, its thought processes actually happen in English (the part of the model which was trained on English text) and for Russian, it's able to map English to Russian and back thanks to its language translation ability, because I've noticed sometimes it produces slightly awkward sentences whose word choice makes sense in English (calques?) and not as much in Russian.
1 comments

I think the same. LLMs are actually sort of "multi-linguas", able to transform source of any language to internal representation and then do output in some other language, thanks to so many layers of neurons inside it.