|
|
|
|
|
by rldjbpin
21 days ago
|
|
may not be the most efficient way to go about things, but there remains a seemingly obvious use case for non-latin languages to do things from scratch. see sarvam.ai and their tokenisation improvements on local languages [1]. not every llm needs to help with coding, nor it needs to already become Babel fish. language is culture, so i can see the motivation behind their initiative. it must be nice to afford to do this yourself. [1] https://www.sarvam.ai/blogs/sarvam-30b-105b |
|
>see sarvam.ai and their tokenisation improvements on local languages
You don't need to build from scratch to improve tokenization, though.
Russia's T-Bank was able to increase generation speeds by 1.5-3x by changing a stock Qwen's tokenizer to include 5 times more Cyrillic tokens (+ post-training on a Russian corpus).