|
|
|
|
|
by gradys
2436 days ago
|
|
This feels hollow. Can't this be said about any benchmark? It seems natural and proper that as one benchmark becomes saturated, we introduce harder benchmarks. I don't think anyone in the field thinks that once we match human performance on benchmark X, we're officially done. It just means it's time for more interesting benchmarks. Over time, if it starts to become difficult to design benchmarks that humans can outperform machines on, then that will prompt interesting conceptual work about what exactly the difference between human and machine language competency is. And then that will lead either to more sophisticated benchmarks or alternatively gradually more sophisticated and persuasive arguments that machines really have surpassed us in language competence. I don't think we're yet at a point where we don't know how to make harder benchmarks, and if and when we do hit such a point, I'd definitely bet the result will be a conceptual advance in benchmark design rather than declaring machine superiority once and for all. At least for the first few rounds of this cycle. |
|
"But instead of concluding that BERT could apparently imbue neural networks with near-Aristotelian reasoning skills, they suspected a simpler explanation: that BERT was picking up on superficial patterns in the way the warrants were phrased. Indeed, after re-analyzing their training data, the authors found ample evidence of these so-called spurious cues. For example, simply choosing a warrant with the word “not” in it led to correct answers 61% of the time. After these patterns were scrubbed from the data, BERT’s score dropped from 77 to 53 — equivalent to random guessing."