|
|
|
|
|
by kgeist
1108 days ago
|
|
For Russian text, it degrades to, basically, 1 character = 1 token, due to the tokenization issues discussed in the article, yet it produces absolutely coherent text, almost same as in English. In my tests, its Russian output is worse than English output, though, something like 80% quality I'd say. I'm not an LLM expert but I have a theory that, being mostly trained on English text, its thought processes actually happen in English (the part of the model which was trained on English text) and for Russian, it's able to map English to Russian and back thanks to its language translation ability, because I've noticed sometimes it produces slightly awkward sentences whose word choice makes sense in English (calques?) and not as much in Russian. |
|