| HN Mirror

This is true, and absolutely a weakness of these tests.

However they don't publish how well a human performs on the dataset without "not" in it.

They do initially note that Even human beings don’t do particularly well on this task without practice

I've looked at the warrant task. It's pretty tricky! I'd bet real money that untrained humans perform much, much lower than the 80% correct rate they get on the full set on ones without "not". I don't think it would be as low as the 53% BERT gets, but it would drop significantly.

I find the HANS analysis[1] much more compelling, but again I'd note that humans suffer on this dataset too (although again - not as badly as models do).

[1] https://www.aclweb.org/anthology/P19-1334.pdf