| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by KRAKRISMOTT 1250 days ago
	Assuming all human languages have a common shared semantic meaning in latent space (I am flipping cause and effect here, but our purposes it doesn't really matter), and assuming that human languages largely follow the same pattern (this assumption is based on the fact that we can trace the roots of modern languages back to the Phoenician script), it is reasonable to assume that we can fine-tune a self supervised model on a tiny amount of data. (The emergent properties of a LLM is carrying a lot of weight here, many of the assumptions rely on the fact that LLM's emergent properties arise from the idea that the latent structure of various languages is learnt by the model)

4 comments

Roark66 1250 days ago

I think you may be on to something here. For example ChatGPT is perfectly capable of "understanding" and speaking Polish while the amount of training data in this language definitely wasn't a lot. It is not as eloquent as in English, but still for a model that has not been trained for translation tasks, this is very cool.

link

nannal 1250 days ago

Its Lithuanian is awful, I'd expect that any language further removed from that which the majority of it's training is in would be worse without a significant punt of data in that language. Its possible having that could affect it's English speaking capability, but that's just speculation on my part.

link

trekkie1024 1250 days ago

So you're saying something like the Universal Translator from Star Trek might be possible?

link

KRAKRISMOTT 1250 days ago

For humans yes, I am not saying one-shot learning would be possible for undocumented indigenous languages but few shot language acquisition in cases of a single surviving speaker is something that I would consider highly probable. This hypothesis relies heavily on the nature of variational learning in latent space and observations about human languages. It is of course possible that some ethnicity would have a language that's so different from other languages that it is effectively alien (and the assumption homo sapiens common brain structure and physiology have no influence on our languages and/or the human neural structure cannot be statistically modeled by latent variables, at least not with the current variational learning techniques). This is possible but very, very unlikely.

link

somenameforme 1250 days ago

In encryption it's generally impossible to decrypt a 1 to many hash. You can do some clever things (correlating and combining other data) but if you're just looking at some hash that could be an infinite number of other things, you're just out of luck.

I'll take the extreme position that language translation is an unsolvable problem because of this exact phenomena. There was a recent case where a politician was accused of making a racist remark. He said something like "You are a donkey." or "You all are donkeys." to another [minority background] politician. Which was it? Well in many languages the second person plural and the second person singular formal are identical. And there are no articles. So the two statements are literally identical. Which did he mean? Nobody will ever know, besides him.

And outside of inherent language ambiguities, start piling on the endless (and ever/rapidly changing) euphemisms, idioms, colloquialisms, metaphors, just plain old ambiguous sarcasm, and all the other things that make language fun (and more expressive). And these sort of things aren't really the exceptions so much as the rule. And it only becomes more common the more distant languages get. Translations from various Asian languages to English often look just hilarious. Now imagine going back to languages exponentially more detached from any modern language, using one can only imagine what sort of expressions, and trying to convert it.

Especially using a neural network type system you'll probably be able to get something. And, even worse, it might well even make sense. That's a problem because, kind of like ChatGPT, it being coherent is zero indication of it being right.

link

KRAKRISMOTT 1250 days ago

I don’t think the pigeonhole principle applies to human languages though

link

defrost 1250 days ago

> we can trace the roots of modern languages back to the Phoenician script

That's modern European languages ... and post ~1100 BCE if I recall correctly.

So, indigenous languages from people settled in Australia [1] for 50,000+ years can be a little different, some don't have "left" | "right" as relative to PoV directions and stick with East V. West as absolutes for example.

We're down to maybe 20-30 from pre colonial 100's though [2].

[1] https://mgnsw.org.au/wp-content/uploads/2019/01/map_col_high...

[2] https://www.youtube.com/watch?v=ZwSXFvDDjYE

link

astrange 1250 days ago

> That's modern European languages ... and post ~1100 BCE if I recall correctly.

That's Indo-European (as in descended from PIE/Yanmaya). Basque is older than all of them and it's a living European language.

link

wongarsu 1250 days ago

And we have about a million cuniform tablets, with maybe 10-100 words each, so we have a couple million words of text to fine tune the model with. Or maybe to use when training the next model, after all GPT can already speak multiple languages.

link

wongarsu 1250 days ago

Addendum: One possible challenge is that so far large lanuage models are trained on a large sample of all text that has been published, while what we have of cuniform is a decent sample of all text that has been written. Meaning most cuniform tablets are inventories, invoices, requests for payment, contracts, tablets from students practicing writing etc. Types of documents that are underrepresented in traditional training data.

link

cedilla 1250 days ago

Phoenician script is the common ancestor of Latin, greek and Cyrillic script.

You're probably thinking of the indo-european language family, which accounts for about 45% of native language speakers. The largest language family in the world, but not even a majority.

Scripts and languages change over the course of decades, and while there are well known mechanisms to those changes, trying to deduce hieroglyphics or ancient Egyptian from a modern corpus is impossible.

The idea that there is some shared structure in all language is known as universal grammar. If that structure exists is still hotly debated.

link

KRAKRISMOTT 1250 days ago

I am not saying that all languages have a shared structure, but from the Bayesian variational learning perspective, as long as the new data shares some structure with what the model has previously encountered, the prior training data contributes to understanding the new information i.e. few-shot. This is in the information theoretic sense, I am not stating any theories about the underlying semantics or grammar.

I know this may be not be the answer that you are looking for, but the way most of these ML systems are designed is based on the idea that life, the universe, and everything can be modeled by a series of joint probabilities. For toy problems you draw a diagram

https://en.m.wikipedia.org/wiki/Graphical_model

It's an old idea in AI (predates even ML) but people have never been able to do anything useful with it outside of exam problems until the emergence of language models on modern deep learning hardware. All of a sudden variational learning and causal inference are not merely statistical word problems for grad students any more. This is the key to how most of the custom deep learning based avatar generators work. They use a Variational Autoencoder. For LLMs, it is in the form of a transformer which contains a sampling step (sampling from a distribution is the key to Bayesian methods).

I would like to emphasize the theory of probabilistic learning is very different from the actual practice. The theory we have today isn't much different from 20 years ago. Implement the methods in for example Murphy's Probabilistic ML book and they would be useless if you don't have access to modern deep learning hardware and gradient descent optimizers. Without deep learning, we won't have LLMs, regardless of how fancy the variational learning theories are.

link

cedilla 1250 days ago

Is there any reason to assume any shared structure for unrelated languages though? Written language is just an encoding for information.

There is a good candidate for a test. Someone will probably already work on it. Minoan as written in Linear A has only survived in a few thousand tokens and despite thousands of man years of effort, natural intelligence has made virtually no progress in understanding it. That's still easy mode, since we know that the Minoans were in contact with speakers of indo-european and Semitic languages, and writers of hieroglyphics and phonetician script, so their written Language was probably influenced by that.

link

KRAKRISMOTT 1250 days ago

There is no reason to, as stated before, it is however a necessary assumption. It is also possible that the assumption is entirely wrong, and the LLM generates a plausible explanation to their language that we cannot falsify. If the shared structure hypothesis is incorrect, then it is no different from dealing with an alien language. (Note we can also feed in related information like where it was found, what the nearby pottery shards at the excavation site are etc. I am lumping all of these under the "shared structure" banner of the LLM's model of humanity/human languages)

link