|
|
|
|
|
by philomath868
298 days ago
|
|
Thank you! The language is Hasidic Yiddish (which is by now different enough from YIVO Yiddish to almost be considered a different language). The amount of (all kinds of) Yiddish included in pre training is probably very little, but not nothing. Also, it's a Germanic language with Hebrew script and roots, and some Slavic roots and suffixes. Most concepts and structure are probably not *very* foreign to a good model. As I wrote in another comment, I have thought about initializing the new embeddings based on equivalent tokens in the old ones (e.g. by translating a token to English and finding the closest old token), but I'm starting to rethink the feasibility. I will probably get more text sometime in the future, but I have to build the first version now. |
|
I ran some tests and, without fine-tuning, GPT can translate medieval German, for example, considerably better than well-known scholars today.