Why would it matter? They wanted data generated for comparison and got it. They wanted a common type of human error and got it too. What's the criticism exactly?
Yes, they got some data, and got some errors. But they are using 25 samples to characterize "human reasoning". Sure, it's a place to start.
The bigger issue is that this pattern - small samples, via Mechanical Turk - is a frequent flier in papers that make claims that end up failing further scrutiny. It's more common in sociology and psychology than AI research, but I think we know this isn't a solid foundation to build a lot of extrapolation on.
The bigger issue is that this pattern - small samples, via Mechanical Turk - is a frequent flier in papers that make claims that end up failing further scrutiny. It's more common in sociology and psychology than AI research, but I think we know this isn't a solid foundation to build a lot of extrapolation on.