| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by cdrini 377 days ago

Hmm I don't think we'd need a rosetta stone. In the same way LLMs associate via purely contextual usage the meaning of words, two separate data sets of lion and English, encoded into the same vector space, might pick up patterns of contextual usage at a high enough level to allow for mapping between the two languages.

For example, given thousands of English sentences with the word "sun", the vector embedding encodes the meaning. Assuming the lion word for "sun" is used in much the same context (near lion words for "hot", "heat", etc), it would likely end up in a similar spot near the English word for sun. And because of our shared context living in earth/being animals, I reckon many words likely will be used in similar contexts.

That's my guess though, note I don't know a ton about the internals of LLMs.

1 comments

zos_kia 377 days ago

Someone more knowledgeable might chime in, but I don't think two corpuses can be mapped to the same vector space. Wouldn't each vector space be derived from its corpus?

link

godelski 377 days ago

It depends how you define the vector space but I'm inclined to agree.

The reason I think this is from evidence in human language. Spend time with any translator and they'll tell you that some things just don't really translate. The main concepts might, but there's subtleties and nuances that really change the feel. You probably notice this with friends who have a different native language than you.

Even same language same language communication is noisy. You even misunderstand your friends and partners, right? The people who have the greatest chance of understanding you. It's because the words you say don't convey all the things in your head. It's heavily compressed. Then the listener has to decompress from those lossy words. I mean you can go to any Internet forum and see this in action. That there's more than one way to interpret anything. Seems most internet fights start this way. So it's good to remember that there isn't an objective communication. We improperly encode as well as improperly decode. It's on us to try to find out what the speaker means, which may be very different from the words they say (take any story or song to see the more extreme versions of this. This feature is heavily used in art)

Really, that comes down to the idea of universal language[0]. I'm not a linguist (I'm an AI researcher), but my understanding is most people don't believe it exists and I buy the arguments. Hard to decouple due to shared origins and experiences.

[0] https://en.wikipedia.org/wiki/Universal_language

link

cdrini 377 days ago

Hmm I don't think a universal language is implied by being able to translate without a rosetta stone. I agree, I don't think there is such a thing as a universal language, per se, but I do wonder if there is a notion of a universal language at a certain level of abstraction.

But I think those ambiguous cases can still be understood/defined. You can describe how this one word in lion doesn't neatly map to a single word in English, and is used like a few different ways. Some of which we might not have a word for in English, in which case we would likely adopt the lion word.

Although note I do think I was wrong about embedding a multilingual corpus into a single space. The example I was thinking of was word2vec, and that appears to only work with one language. Although I did find some papers showing that you can unsupervised align between the two spaces, but don't know how successful that is, or how that would treat these ambiguous cases.

link

godelski 377 days ago

  > I don't think a universal language is implied by being able to translate without a rosetta stone.

Depends what you mean. If you want a 1-to-1 translation then your languages need to be isomorphic. For lossy translation you still need some intersection within the embedding space. The intersection will determine how good you can translate. It isn't unreasonable to assume that there are some universal traits here as any being lives in this universe and we're all subject to these experiences at some level, right? But that could result in some very lossy translations that are effectively impossible to translate, right?

Another way you can think about it, though, is that language might not be dependent on experience. If it is completely divorced, we may be able to understand anyone regardless of experience. If it is mixed, then results can be mixed.

  > The example I was thinking of was word2vec

Be careful with this. If you haven't actually gone deep into the math (more than 3Blue1Brown) you'll find some serious limitations to this. Play around with it and you'll experience these too. Distances in high dimensions are not well defined. There also aren't smooth embeddings here. You have a lot of similar problems to embedding methods like t-SNE. Certainly has uses but it is far too easy to draw the wrong conclusions from them. Unfortunately, both of these are often spoken about incorrectly (think as incorrect as most peoples understandings of things like Schrodinger's Cat or the Double Slit experiment, or really most of QM. There's some elements of truth but it's communicated through a game of telephone).

link

cdrini 377 days ago

That's a very good point! I hadn't thought of that. And that makes sense, since the encoding of the word "sun" arises from its linguistic context, and there's no such shared context between the English word sun and any lion word in this imaginary multilingual corpus, so I don't think they'd go to the same point.

Apparently one thing you could do is train a word2vec on each corpus and then align them based on proximity/distances. Apparently this is called "unsupervised" alignment and there's a tool by Facebook called MUSE to do it. (TIL, Thanks ChatGPT!) https://github.com/facebookresearch/MUSE?tab=readme-ov-file

Although I wonder if there are better embedding approaches now as well. Word2Vec is what I've played around with from a few years ago, I'm sure it's ancient now!

Edit: that's what I get for posting before finishing the article! The whole point of their researh is to try to build such a mapping, ve2vec!

link