Hacker News new | ask | show | jobs
by gradys 2436 days ago
This feels hollow. Can't this be said about any benchmark? It seems natural and proper that as one benchmark becomes saturated, we introduce harder benchmarks.

I don't think anyone in the field thinks that once we match human performance on benchmark X, we're officially done. It just means it's time for more interesting benchmarks.

Over time, if it starts to become difficult to design benchmarks that humans can outperform machines on, then that will prompt interesting conceptual work about what exactly the difference between human and machine language competency is. And then that will lead either to more sophisticated benchmarks or alternatively gradually more sophisticated and persuasive arguments that machines really have surpassed us in language competence.

I don't think we're yet at a point where we don't know how to make harder benchmarks, and if and when we do hit such a point, I'd definitely bet the result will be a conceptual advance in benchmark design rather than declaring machine superiority once and for all. At least for the first few rounds of this cycle.

5 comments

From the Quanta article:

"But instead of concluding that BERT could apparently imbue neural networks with near-Aristotelian reasoning skills, they suspected a simpler explanation: that BERT was picking up on superficial patterns in the way the warrants were phrased. Indeed, after re-analyzing their training data, the authors found ample evidence of these so-called spurious cues. For example, simply choosing a warrant with the word “not” in it led to correct answers 61% of the time. After these patterns were scrubbed from the data, BERT’s score dropped from 77 to 53 — equivalent to random guessing."

This is true, and absolutely a weakness of these tests.

However they don't publish how well a human performs on the dataset without "not" in it.

They do initially note that Even human beings don’t do particularly well on this task without practice

I've looked at the warrant task. It's pretty tricky! I'd bet real money that untrained humans perform much, much lower than the 80% correct rate they get on the full set on ones without "not". I don't think it would be as low as the 53% BERT gets, but it would drop significantly.

I find the HANS analysis[1] much more compelling, but again I'd note that humans suffer on this dataset too (although again - not as badly as models do).

[1] https://www.aclweb.org/anthology/P19-1334.pdf

> Can't this be said about any benchmark?

Maybe it should be? The "dieselgate" talk[1] at 32c3 suggests engineering has gotten very good[2] at "teaching machines to the test".

[1] https://media.ccc.de/v/32c3-7331-the_exhaust_emissions_scand... (good text summary: https://lwn.net/Articles/670488/ )

[2] https://static.lwn.net/images/2016/vw-curves.png

Yes. This is more generally known as Goodhart's Law[0]: when a metric is used as a goal, then people will game the metric in order to win, making the metric useless.

There is no fundamental way to overcome this problem, except by not using metrics as goals.

[0] https://en.wikipedia.org/wiki/Goodhart's_law

You'd be absolutely right, if only this kind of event didn't so often trigger pop articles about how AI is now superhuman at XYZ.
Andrew Ng has a great summary on the purpose of human level performance: https://www.coursera.org/lecture/machine-learning-projects/w...