Hacker News new | ask | show | jobs
by thepablohansen 931 days ago
Fascinating that the human benchmark is 63%- I wonder what the benchmark would look like were it to have been established, say, 30 years ago, before the prevalence of LLMs; I'd wager it would be very close to 100%. Speaks to the moving goalposts.
2 comments

I think the key thing to remember with the human benchmark is that it will change depending on interrogators' assumptions about human abilities.

When models are bad, humans are easy to spot. But if the model's pretty good, it's harder to be sure you're talking to a human.

On top of that, I think a lot of human users didn't want to get got by the model, so they had an a priori bias toward saying AI.

This is almost surely just because it wasn't a real Turing Test. The paper discusses some limitations.