Hacker News new | ask | show | jobs
by agentcoops 295 days ago
Not an answer to your original question, but I think you’d be surprised how much high quality historical linguistic content was hiding in the dusty old corners of the internet. I’ve been doing some work recently with LLMs on historical languages (various forms of Latin, Ancient Greek and medieval European languages) and the out-of-the-box performance of state of the art LLMs is shockingly good. It isn’t that surprising when you remember all these archive digitization projects that took place in the early 00s, but ended up either as stale links, preserved only by archive.org, or stored in arcane CRMs essentially unusable by humans. I assume the same is especially true for various historical Yiddish corpora.

I ran some tests and, without fine-tuning, GPT can translate medieval German, for example, considerably better than well-known scholars today.

1 comments

Why would you throw out the original embedding layer? That seems like a step backwards to me. It's likely it was partly trained on Yiddish and without it you're throwing out a lot of information in the rest of the model.