| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by YeGoblynQueenne 792 days ago

>> Here’s a very basic example of where an LLM is clearly more capable than a human: language translation. I would bet $10k at 10:1 that there are no humans who can reliably translate to and from as many languages as an LLM can.

See, translation is exactly the kind of domain where there are no good measures of performance and where performance is open to subjective interpretation, and a lot of it. That's because we don't know what is a "good translation" and, crucially, machine translation systems and language models have not helped us find out.

The way machine translation systems are evaluated is generally by a metric based on the similarity to an arbitrarily chosen "gold standard" translation. What that means in practice is that we have some corpus of parallel texts, we train a machine translation system on a part of the corpus and then test it on the held-out test set. The way we test is that we take each e.g. sentence in a text translated by the system and we compare it, as a bag-of-words or a set of n-grams, to the text in the original translation. If there is a high amount of overlap, the system scores highly. That's the way BLEU scores work and similar metrics like ROUGE.

It is important to note how arbitrary is this metric: out of all possible translations we choose one to be the "reference" translation and compare machine translations to it. The only accepted alternative is eyballing, where we give the machine translation to a bunch of humans and ask them how they feel about it.

My point is that we don't know how to measure knowledge, and language models are trained to maximise similarity, not knowledge. So there's no way to go from observations of their behaviour to a measure of their knowledge. All you can say about a language model is that it is good, or bad, at generating text that's similar to its training corpus. Everything else is an assumption.

1 comments

williamcotton 792 days ago

Good god, people, we measure knowledge all the time with testing. We have a difficult time measuring intelligence but we have no problem measuring someone’s knowledge about the major events that led up to the Battle of Waterloo.

Just give the participates the final from my French 3 exam but also in 100 different language combinations. I bet you do worse than ChatGPT.

YeGoblynQueenne 792 days ago

>> Good god, people, we measure knowledge all the time with testing.

In humans. Not in machines.

You're proposing to use a test of human knowledge as a test of computer knowledge, when the question in the first place is whether a computer can have knowledge at all. It's like giving an IQ test to a frog and concluding that the frog has no IQ because it can't answer the questions, only reversed: the machine answers the questions, therefore it has knowledge. Who cares about mechanisms, who cares how the answers are generated, if I see answers, that's knowledge.

Well that is a pre-scientific way to look at the world. I observe the sun, it looks like it's moving around the Earth, therefore the sun turns around the Earth. No room left for critical inquiry or understanding of the cause of phenomena. We have a test? Bash it against anything and we'll get some answers, and then we'll claim that they're the right answers because that's the right test, since it gave us the right answers. And all that, not for some mysterious physical phenomenon that we're not responsible for, but for a machine, created and programmed by humans, and we know exactly how.

No no. That's not good engineering, and it's not good science: it doesn't explain the how, and it doesn't explain the why.