|
|
|
|
|
by fzimmermann89
289 days ago
|
|
How foreign is the language - was it likely included in pre training to some degree? Does it use grammar, syllables, and logic similiar to one of the large languages?
Your approach assumes there is an easy to learn mapping between context in your target language and concepts in a prettained llm. Can you get more text written in the low resources language? Are you ok to share the name of the language? |
|
The language is Hasidic Yiddish (which is by now different enough from YIVO Yiddish to almost be considered a different language). The amount of (all kinds of) Yiddish included in pre training is probably very little, but not nothing. Also, it's a Germanic language with Hebrew script and roots, and some Slavic roots and suffixes. Most concepts and structure are probably not *very* foreign to a good model.
As I wrote in another comment, I have thought about initializing the new embeddings based on equivalent tokens in the old ones (e.g. by translating a token to English and finding the closest old token), but I'm starting to rethink the feasibility.
I will probably get more text sometime in the future, but I have to build the first version now.