Hacker News new | ask | show | jobs
by not2b 2428 days ago
From the Quanta article:

"But instead of concluding that BERT could apparently imbue neural networks with near-Aristotelian reasoning skills, they suspected a simpler explanation: that BERT was picking up on superficial patterns in the way the warrants were phrased. Indeed, after re-analyzing their training data, the authors found ample evidence of these so-called spurious cues. For example, simply choosing a warrant with the word “not” in it led to correct answers 61% of the time. After these patterns were scrubbed from the data, BERT’s score dropped from 77 to 53 — equivalent to random guessing."

1 comments

This is true, and absolutely a weakness of these tests.

However they don't publish how well a human performs on the dataset without "not" in it.

They do initially note that Even human beings don’t do particularly well on this task without practice

I've looked at the warrant task. It's pretty tricky! I'd bet real money that untrained humans perform much, much lower than the 80% correct rate they get on the full set on ones without "not". I don't think it would be as low as the 53% BERT gets, but it would drop significantly.

I find the HANS analysis[1] much more compelling, but again I'd note that humans suffer on this dataset too (although again - not as badly as models do).

[1] https://www.aclweb.org/anthology/P19-1334.pdf