| >> Here’s a very basic example of where an LLM is clearly more capable than a human: language translation. I would bet $10k at 10:1 that there are no humans who can reliably translate to and from as many languages as an LLM can. See, translation is exactly the kind of domain where there are no good measures of performance and where performance is open to subjective interpretation, and a lot of it. That's because we don't know what is a "good translation" and, crucially, machine translation systems and language models have not helped us find out. The way machine translation systems are evaluated is generally by a metric based on the similarity to an arbitrarily chosen "gold standard" translation. What that means in practice is that we have some corpus of parallel texts, we train a machine translation system on a part of the corpus and then test it on the held-out test set. The way we test is that we take each e.g. sentence in a text translated by the system and we compare it, as a bag-of-words or a set of n-grams, to the text in the original translation. If there is a high amount of overlap, the system scores highly. That's the way BLEU scores work and similar metrics like ROUGE. It is important to note how arbitrary is this metric: out of all possible translations we choose one to be the "reference" translation and compare machine translations to it. The only accepted alternative is eyballing, where we give the machine translation to a bunch of humans and ask them how they feel about it. My point is that we don't know how to measure knowledge, and language models are trained to maximise similarity, not knowledge. So there's no way to go from observations of their behaviour to a measure of their knowledge. All you can say about a language model is that it is good, or bad, at generating text that's similar to its training corpus. Everything else is an assumption. |
Just give the participates the final from my French 3 exam but also in 100 different language combinations. I bet you do worse than ChatGPT.