| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by philomath868 297 days ago
	Perhaps. But I don't think there is an existing (open weights) model that really knows YIVO Yiddish, either, so what should I base this fine-tuning on?

4 comments

yorwba 297 days ago

You might be able to start with German, since German-Yiddish cognates tend to have fairly regular spelling correspondences (not exactly one-to-one, but often few-to-one).

So given a Latin-script token from a model that does OK in German (bonus points if it also does Hebrew), generate several candidate Hebrew-script tokens with some regex search-and-replace, then use the resulting vocabulary to tokenize your Yiddish corpus and for each original token keep the candidate replacement that was used most often in the tokenization.

This vocabulary replacement should give you a model that does OK in German-in-Hebrew-script. I think that would be a better base for a Yiddish model than training from scratch, but of course that's just a hunch that might turn out to be wrong.

link

bc569a80a344f9c 297 days ago

Qwen3 lists Eastern Yiddish (presumably YIVO) as one of the 119 training languages. It’s available at various sizes including rather small ones to experiment with cheaply, and has good documentation for suggested fine-tuning pipelines. I’d start with that.

link

bc569a80a344f9c 292 days ago

If you’re still looking at it, there’s a new open weights model that is focusing on multi-linguality: https://news.ycombinator.com/item?id=45108401

link

agentcoops 297 days ago

For a similar project, I worked with GPT to create an extensive dataset of translations from a historical language. I could then use this both to evaluate base capacity of other models in the language, i.e. giving the model the task of translating the various passages and evaluating the results with GPT, as well as for fine-tuning.

link