Hacker News new | ask | show | jobs
by 6gvONxR4sf7o 2435 days ago
One thing to always point out in these cases is that the human baseline isn't "how well people do at this task," like it's often hyped to be. It's "how well does a person quickly and repetitively doing this do, on average." The 'quickly and repetitively' part is important because we all make more boneheaded errors in this scenario. The 'on average' part is important because the errors the algo makes aren't just fewer than people, they're different. The algos often still get certain things wrong that humans almost never would.

This is really really super great, let's be clear. It's just not up to the hype "omg super human" usually gets.

4 comments

It seems to mean "How well does Mechanical Turk do the task?" which is a separate thing again. And yes - error type is at least as revealing as error frequency.

I have no idea where the real human baseline is, or how to find it.

Also, consider this discussion. GLUE winners may be able to make informed parsing guesses about single text blocks, but they're years away from being able to make a useful contribution to a discussion like this one.

Regarding the type of errors, it seems like the benchmark should be able to take that into account. That is, get a load of humans to do the task on the same specific examples, then for each example you know how hard it is, and what acceptable answers are (I bet a lot of the ground truth is wrong or ambiguous).

Then you can benchmark your AI but penalise it more heavily for getting things wrong that are obvious to a human.

That would be ideal, if money weren't a factor. Since money is a factor, I wonder what the tradeoff is between labelling each instance N more times versus just getting N times more instances labeled.
In the context of GPT2 someone coined the expression "Humans Who Are Not Concentrating Are Not General Intelligences"
Great point! It makes sense in the context of what these algorithms would generally be tasked with.