| HN Mirror

Here’s a quick sanity check before you embark on the real thing:

- Tokenize your entire corpus with a few off-the-shelf multilingual tokenizers like Llama, Qwen and Gemma and calculate the ratio of letters to tokens. The higher the better, ideally in the 3-5 range

- Manually produce or select sentences that are similar in meaning but not in writing (embedding models also leverage graphemic overlap not just semantic similarity), and then check if similar sentences show consistently higher cosine similarity than dissimilar sentences. This is for embedding models like XLM RoBERTa not for LLMs, but it has a similar insight potential.

If both of these tests are promising then you likely don’t need custom implementations for these.

Personally if I was you, I would just take Qwen 0.6B Base (not instruct, since you want text completion) and continue pretraining it on the data that you have. It is very likely to work decently out of the box.