Semi offtopic, but for some time I have been dreaming of training chatbot to communicate in cuneiform or hieroglyphs to bring some old languages back alive. Could it be possible, using old tablets as training data?
That's basically the problem of unsupervised machine translation using mainly monolingual corpora. It means giving a machine learning model tons of text in two languages and let it figure out how to do translation between some old language X and e.g. english. There's no need to feed it a parallel corpora, i.e. examples of sentences in X languages and their translations in english.
In some situations, this seemingly impossible task is doable and can yield good results. Researchers sometimes need to kickstart their models by giving them a mapping between words of the two languages (for english <-> french: "cat" <-> "chat", "book" <-> "livre" and so on). That's just simple vocabulary. While it's technically possible to learn this mapping from scratch, it's too difficult as for now.
Do you know of the Encoder-Decoder architecture? You feed something (image, text) to the encoder which compresses it to a very dense representation, and the decoder try to use the resulting dense vector to do useful stuff with it. The input could a sentence in english, the encoder then encodes it and the decoder tries to use the output of the encoder to generate the same sentence but in french. These architectures are useful because directly working with "plaintext" to learn how to do translation is way too expensive. I mean, that's one of the reasons.
What the encoder does is mapping a "sparse" representation of a sentence (plaintext) to a dense representation in a well-structured space (think of word2vec which managed to find that "king" + "woman" = "queen"). This space is called the "latent space". Some say it extracts the "meaning" of the sentence. To be more precise, it learns to extract enough information from the input and present it to the decoder in such a way that the decoder becomes able to solve a given task (machine translation, text summarizing etc).
One of the main assumption of the unsupervised models using monolingual data only is that both languages can be mapped to the same latent space. In other words, we assume that every sentences/texts in english has its exact french (or whatever) equivalent, that the resulting translated sentences contain exactly the same information/meaning as the original ones.
That's quite the dubious assumption. There's obviously some ideas, some stuff that can be expressed in some languages but can't be exactly expressed in some others. While theoretically unsound, however, these models were able to achieve pretty damn good results in the last couple of years.
I think we need a generic ai before we're able to do that as the data set is small and you would need to infer the rules.
A human is able to learn rules way more efficiently than ChatGPT.
Assuming all human languages have a common shared semantic meaning in latent space (I am flipping cause and effect here, but our purposes it doesn't really matter), and assuming that human languages largely follow the same pattern (this assumption is based on the fact that we can trace the roots of modern languages back to the Phoenician script), it is reasonable to assume that we can fine-tune a self supervised model on a tiny amount of data. (The emergent properties of a LLM is carrying a lot of weight here, many of the assumptions rely on the fact that LLM's emergent properties arise from the idea that the latent structure of various languages is learnt by the model)
I think you may be on to something here. For example ChatGPT is perfectly capable of "understanding" and speaking Polish while the amount of training data in this language definitely wasn't a lot. It is not as eloquent as in English, but still for a model that has not been trained for translation tasks, this is very cool.
Its Lithuanian is awful, I'd expect that any language further removed from that which the majority of it's training is in would be worse without a significant punt of data in that language. Its possible having that could affect it's English speaking capability, but that's just speculation on my part.
For humans yes, I am not saying one-shot learning would be possible for undocumented indigenous languages but few shot language acquisition in cases of a single surviving speaker is something that I would consider highly probable. This hypothesis relies heavily on the nature of variational learning in latent space and observations about human languages. It is of course possible that some ethnicity would have a language that's so different from other languages that it is effectively alien (and the assumption homo sapiens common brain structure and physiology have no influence on our languages and/or the human neural structure cannot be statistically modeled by latent variables, at least not with the current variational learning techniques). This is possible but very, very unlikely.
In encryption it's generally impossible to decrypt a 1 to many hash. You can do some clever things (correlating and combining other data) but if you're just looking at some hash that could be an infinite number of other things, you're just out of luck.
I'll take the extreme position that language translation is an unsolvable problem because of this exact phenomena. There was a recent case where a politician was accused of making a racist remark. He said something like "You are a donkey." or "You all are donkeys." to another [minority background] politician. Which was it? Well in many languages the second person plural and the second person singular formal are identical. And there are no articles. So the two statements are literally identical. Which did he mean? Nobody will ever know, besides him.
And outside of inherent language ambiguities, start piling on the endless (and ever/rapidly changing) euphemisms, idioms, colloquialisms, metaphors, just plain old ambiguous sarcasm, and all the other things that make language fun (and more expressive). And these sort of things aren't really the exceptions so much as the rule. And it only becomes more common the more distant languages get. Translations from various Asian languages to English often look just hilarious. Now imagine going back to languages exponentially more detached from any modern language, using one can only imagine what sort of expressions, and trying to convert it.
Especially using a neural network type system you'll probably be able to get something. And, even worse, it might well even make sense. That's a problem because, kind of like ChatGPT, it being coherent is zero indication of it being right.
> we can trace the roots of modern languages back to the Phoenician script
That's modern European languages ... and post ~1100 BCE if I recall correctly.
So, indigenous languages from people settled in Australia [1] for 50,000+ years can be a little different, some don't have "left" | "right" as relative to PoV directions and stick with East V. West as absolutes for example.
We're down to maybe 20-30 from pre colonial 100's though [2].
And we have about a million cuniform tablets, with maybe 10-100 words each, so we have a couple million words of text to fine tune the model with. Or maybe to use when training the next model, after all GPT can already speak multiple languages.
Addendum: One possible challenge is that so far large lanuage models are trained on a large sample of all text that has been published, while what we have of cuniform is a decent sample of all text that has been written. Meaning most cuniform tablets are inventories, invoices, requests for payment, contracts, tablets from students practicing writing etc. Types of documents that are underrepresented in traditional training data.
Phoenician script is the common ancestor of Latin, greek and Cyrillic script.
You're probably thinking of the indo-european language family, which accounts for about 45% of native language speakers. The largest language family in the world, but not even a majority.
Scripts and languages change over the course of decades, and while there are well known mechanisms to those changes, trying to deduce hieroglyphics or ancient Egyptian from a modern corpus is impossible.
The idea that there is some shared structure in all language is known as universal grammar. If that structure exists is still hotly debated.
I am not saying that all languages have a shared structure, but from the Bayesian variational learning perspective, as long as the new data shares some structure with what the model has previously encountered, the prior training data contributes to understanding the new information i.e. few-shot. This is in the information theoretic sense, I am not stating any theories about the underlying semantics or grammar.
I know this may be not be the answer that you are looking for, but the way most of these ML systems are designed is based on the idea that life, the universe, and everything can be modeled by a series of joint probabilities. For toy problems you draw a diagram
It's an old idea in AI (predates even ML) but people have never been able to do anything useful with it outside of exam problems until the emergence of language models on modern deep learning hardware. All of a sudden variational learning and causal inference are not merely statistical word problems for grad students any more. This is the key to how most of the custom deep learning based avatar generators work. They use a Variational Autoencoder. For LLMs, it is in the form of a transformer which contains a sampling step (sampling from a distribution is the key to Bayesian methods).
I would like to emphasize the theory of probabilistic learning is very different from the actual practice. The theory we have today isn't much different from 20 years ago. Implement the methods in for example Murphy's Probabilistic ML book and they would be useless if you don't have access to modern deep learning hardware and gradient descent optimizers. Without deep learning, we won't have LLMs, regardless of how fancy the variational learning theories are.
Is there any reason to assume any shared structure for unrelated languages though? Written language is just an encoding for information.
There is a good candidate for a test. Someone will probably already work on it. Minoan as written in Linear A has only survived in a few thousand tokens and despite thousands of man years of effort, natural intelligence has made virtually no progress in understanding it. That's still easy mode, since we know that the Minoans were in contact with speakers of indo-european and Semitic languages, and writers of hieroglyphics and phonetician script, so their written Language was probably influenced by that.
There is no reason to, as stated before, it is however a necessary assumption. It is also possible that the assumption is entirely wrong, and the LLM generates a plausible explanation to their language that we cannot falsify. If the shared structure hypothesis is incorrect, then it is no different from dealing with an alien language. (Note we can also feed in related information like where it was found, what the nearby pottery shards at the excavation site are etc. I am lumping all of these under the "shared structure" banner of the LLM's model of humanity/human languages)
In some situations, this seemingly impossible task is doable and can yield good results. Researchers sometimes need to kickstart their models by giving them a mapping between words of the two languages (for english <-> french: "cat" <-> "chat", "book" <-> "livre" and so on). That's just simple vocabulary. While it's technically possible to learn this mapping from scratch, it's too difficult as for now.
Do you know of the Encoder-Decoder architecture? You feed something (image, text) to the encoder which compresses it to a very dense representation, and the decoder try to use the resulting dense vector to do useful stuff with it. The input could a sentence in english, the encoder then encodes it and the decoder tries to use the output of the encoder to generate the same sentence but in french. These architectures are useful because directly working with "plaintext" to learn how to do translation is way too expensive. I mean, that's one of the reasons.
What the encoder does is mapping a "sparse" representation of a sentence (plaintext) to a dense representation in a well-structured space (think of word2vec which managed to find that "king" + "woman" = "queen"). This space is called the "latent space". Some say it extracts the "meaning" of the sentence. To be more precise, it learns to extract enough information from the input and present it to the decoder in such a way that the decoder becomes able to solve a given task (machine translation, text summarizing etc).
One of the main assumption of the unsupervised models using monolingual data only is that both languages can be mapped to the same latent space. In other words, we assume that every sentences/texts in english has its exact french (or whatever) equivalent, that the resulting translated sentences contain exactly the same information/meaning as the original ones.
That's quite the dubious assumption. There's obviously some ideas, some stuff that can be expressed in some languages but can't be exactly expressed in some others. While theoretically unsound, however, these models were able to achieve pretty damn good results in the last couple of years.