Hacker News new | ask | show | jobs
by pmoriarty 2432 days ago
There was an article[1] posted to HN recently about these benchmarks, and it was pretty skeptical.

Regarding SuperGLUE specifically, it asked:

"Indeed, Bowman and his collaborators recently introduced a test called SuperGLUE that's specifically designed to be hard for BERT-based systems. So far, no neural network can beat human performance on it. But even if (or when) it happens, does it mean that machines can really understand language any better than before? Or does just it mean that science has gotten better at teaching machines to the test?"

[1] - https://www.quantamagazine.org/machines-beat-humans-on-a-rea...

5 comments

This feels hollow. Can't this be said about any benchmark? It seems natural and proper that as one benchmark becomes saturated, we introduce harder benchmarks.

I don't think anyone in the field thinks that once we match human performance on benchmark X, we're officially done. It just means it's time for more interesting benchmarks.

Over time, if it starts to become difficult to design benchmarks that humans can outperform machines on, then that will prompt interesting conceptual work about what exactly the difference between human and machine language competency is. And then that will lead either to more sophisticated benchmarks or alternatively gradually more sophisticated and persuasive arguments that machines really have surpassed us in language competence.

I don't think we're yet at a point where we don't know how to make harder benchmarks, and if and when we do hit such a point, I'd definitely bet the result will be a conceptual advance in benchmark design rather than declaring machine superiority once and for all. At least for the first few rounds of this cycle.

From the Quanta article:

"But instead of concluding that BERT could apparently imbue neural networks with near-Aristotelian reasoning skills, they suspected a simpler explanation: that BERT was picking up on superficial patterns in the way the warrants were phrased. Indeed, after re-analyzing their training data, the authors found ample evidence of these so-called spurious cues. For example, simply choosing a warrant with the word “not” in it led to correct answers 61% of the time. After these patterns were scrubbed from the data, BERT’s score dropped from 77 to 53 — equivalent to random guessing."

This is true, and absolutely a weakness of these tests.

However they don't publish how well a human performs on the dataset without "not" in it.

They do initially note that Even human beings don’t do particularly well on this task without practice

I've looked at the warrant task. It's pretty tricky! I'd bet real money that untrained humans perform much, much lower than the 80% correct rate they get on the full set on ones without "not". I don't think it would be as low as the 53% BERT gets, but it would drop significantly.

I find the HANS analysis[1] much more compelling, but again I'd note that humans suffer on this dataset too (although again - not as badly as models do).

[1] https://www.aclweb.org/anthology/P19-1334.pdf

> Can't this be said about any benchmark?

Maybe it should be? The "dieselgate" talk[1] at 32c3 suggests engineering has gotten very good[2] at "teaching machines to the test".

[1] https://media.ccc.de/v/32c3-7331-the_exhaust_emissions_scand... (good text summary: https://lwn.net/Articles/670488/ )

[2] https://static.lwn.net/images/2016/vw-curves.png

Yes. This is more generally known as Goodhart's Law[0]: when a metric is used as a goal, then people will game the metric in order to win, making the metric useless.

There is no fundamental way to overcome this problem, except by not using metrics as goals.

[0] https://en.wikipedia.org/wiki/Goodhart's_law

You'd be absolutely right, if only this kind of event didn't so often trigger pop articles about how AI is now superhuman at XYZ.
Andrew Ng has a great summary on the purpose of human level performance: https://www.coursera.org/lecture/machine-learning-projects/w...
Even when you will be able to have a 100% coherent and deep discussion with an AI over a niche technical domain, there will be people to pretend that the AI "fakes" it.

Systems like GPT-2, incredibly (I used to be a skeptic of a pure statistical approach) manage to extract meaning, keep a theme, and understand the intent behind a sentence. They are amazing.

When you have a system that displays all the characteristics of understanding something, it is irrelevant whether or not it "fakes" it. No one ever proved that humans are not "faking" intelligence either.

As long as they're not training on the test data, and they're not submitting hundreds of submissions tweaking parameters trying to improve their score, I don't see what the problem is. If the algorithm can do a great job at classifying hundreds of new test cases it has never seen, and it isn't over-fitted, then that means it is good at that specific task. Of course the task itself may or may not be useful, and you can have some meta discussion about what "understanding language" is, but the computer definitely is doing a super human job at that given task.
Maybe it's over-fitted on the new data. There has to be a constant infusion of new training data and a system can only prove itself over time.

These rankings, if real, should be in constant flux.

(I work in this field, although not specifically on benchmarking)

I think that this article makes a good point, and correctly identifies weaknesses.

However, I also think that humans often take very similar shortcuts. There are good reasons why "bag of words" approaches work much of the time. Additionally there's lots of evidence showing that very rapid reading by humans does not imply deep understanding.

I think it's very important that people are aware of the weaknesses of these types of models. However, I think it's interesting that these weaknesses are becoming harder and harder to find.

the machines are always trained with the same dataset for each task. the biggest difference right now is small technical modifications on models that are also pre trained on gigantic unlabelled datasets. this doesn't feel like we're teaching them to do the test specifically at all