| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cdrini 377 days ago

That's a very good point! I hadn't thought of that. And that makes sense, since the encoding of the word "sun" arises from its linguistic context, and there's no such shared context between the English word sun and any lion word in this imaginary multilingual corpus, so I don't think they'd go to the same point.

Apparently one thing you could do is train a word2vec on each corpus and then align them based on proximity/distances. Apparently this is called "unsupervised" alignment and there's a tool by Facebook called MUSE to do it. (TIL, Thanks ChatGPT!) https://github.com/facebookresearch/MUSE?tab=readme-ov-file

Although I wonder if there are better embedding approaches now as well. Word2Vec is what I've played around with from a few years ago, I'm sure it's ancient now!

Edit: that's what I get for posting before finishing the article! The whole point of their researh is to try to build such a mapping, ve2vec!