Hacker News new | ask | show | jobs
by cedilla 1249 days ago
Phoenician script is the common ancestor of Latin, greek and Cyrillic script.

You're probably thinking of the indo-european language family, which accounts for about 45% of native language speakers. The largest language family in the world, but not even a majority.

Scripts and languages change over the course of decades, and while there are well known mechanisms to those changes, trying to deduce hieroglyphics or ancient Egyptian from a modern corpus is impossible.

The idea that there is some shared structure in all language is known as universal grammar. If that structure exists is still hotly debated.

1 comments

I am not saying that all languages have a shared structure, but from the Bayesian variational learning perspective, as long as the new data shares some structure with what the model has previously encountered, the prior training data contributes to understanding the new information i.e. few-shot. This is in the information theoretic sense, I am not stating any theories about the underlying semantics or grammar.

I know this may be not be the answer that you are looking for, but the way most of these ML systems are designed is based on the idea that life, the universe, and everything can be modeled by a series of joint probabilities. For toy problems you draw a diagram

https://en.m.wikipedia.org/wiki/Graphical_model

It's an old idea in AI (predates even ML) but people have never been able to do anything useful with it outside of exam problems until the emergence of language models on modern deep learning hardware. All of a sudden variational learning and causal inference are not merely statistical word problems for grad students any more. This is the key to how most of the custom deep learning based avatar generators work. They use a Variational Autoencoder. For LLMs, it is in the form of a transformer which contains a sampling step (sampling from a distribution is the key to Bayesian methods).

I would like to emphasize the theory of probabilistic learning is very different from the actual practice. The theory we have today isn't much different from 20 years ago. Implement the methods in for example Murphy's Probabilistic ML book and they would be useless if you don't have access to modern deep learning hardware and gradient descent optimizers. Without deep learning, we won't have LLMs, regardless of how fancy the variational learning theories are.

Is there any reason to assume any shared structure for unrelated languages though? Written language is just an encoding for information.

There is a good candidate for a test. Someone will probably already work on it. Minoan as written in Linear A has only survived in a few thousand tokens and despite thousands of man years of effort, natural intelligence has made virtually no progress in understanding it. That's still easy mode, since we know that the Minoans were in contact with speakers of indo-european and Semitic languages, and writers of hieroglyphics and phonetician script, so their written Language was probably influenced by that.

There is no reason to, as stated before, it is however a necessary assumption. It is also possible that the assumption is entirely wrong, and the LLM generates a plausible explanation to their language that we cannot falsify. If the shared structure hypothesis is incorrect, then it is no different from dealing with an alien language. (Note we can also feed in related information like where it was found, what the nearby pottery shards at the excavation site are etc. I am lumping all of these under the "shared structure" banner of the LLM's model of humanity/human languages)