|
|
|
|
|
by topce
97 days ago
|
|
Very Interesting... I have similar idea to train LLM in Serbian, create even new encoding https://github.com/topce/YUTF-8 inspired by YUSCII.
Did not have time and money ;-)
Great that you succeed.
Idea if train in Serbian text encoded in YUTF-8 (not UTF-8) it will have less token when prompt in Serbian then English, also Serbian Cyrillic characters are 1 byte in YUTF-8 instead of 2 in UTF.Serbian language is phonetic we never ask how you spell it.Have Latin and Cyrillic letters. |
|
https://huggingface.co/spaces/lapa-llm/lapa
Best tokenizer for the Ukrainian language
Thanks to a SOTA method for tokenizer adaptation developed by Mykola Haltiuk as part of this project, it was possible to replace 80,000 tokens out of 250,000 with Ukrainian ones without loss of model quality, thus making Lapa LLM the fastest model for working with the Ukrainian language. Compared to the original Gemma 3, for working with Ukrainian, the model requires 1.5 times fewer tokens, thus performing three times fewer computations to achieve better results.