|
|
|
|
|
by kgeist
23 days ago
|
|
>but there remains a seemingly obvious use case for non-latin languages to do things from scratch >see sarvam.ai and their tokenisation improvements on local languages You don't need to build from scratch to improve tokenization, though. Russia's T-Bank was able to increase generation speeds by 1.5-3x by changing a stock Qwen's tokenizer to include 5 times more Cyrillic tokens (+ post-training on a Russian corpus). |
|
the great thing about the current momentum is that someone can test this hypothesis by applying the T-Bank approach to the same set of languages and compare outcomes.
unfortunately not everyone has the same level of respectable compute this easily available. at least those outside of the ZIRP/VC ecosystem of the valley.