Hacker News new | ask | show | jobs
by notahacker 900 days ago
"GPT-4 achieved a success rate of 41 percent, second only to actual humans" also feels like a (much bigger) lie of omission looking at the original paper. GPT4's performance was in the range of 6% to 41%, Eliza's 27% score sat in the upper middle of that range, and considering the bots tested consisted of 8 GPT4 prompts, 2 GPT3.5 prompts and a naive script from the 1960s, GPT4 would have had to be remarkably consistently inhuman not to finish "second only to humans" with its highest scoring prompt

The blog appears to have been updated to specify GPT3.5, but the original version was accurate.

The paper itself is interesting as it covers the limitations (it has big methodological issues), how the GPT prompts attempted to overridei default chatGPT tone and reasons why ELIZA performed surprisingly well (some thought it was so uncooperative, it must be human!) https://arxiv.org/pdf/2310.20216.pdf

3 comments

The example ELIZA responses in the paper are so laughably bad and trivial to pick up, I'm not convinced the human interrogators were sober/conscious/awake during the experiment.
tbf the human side of those conversations isn't much better. I think if someone tried prompt injection hacks on me I'd be tempted to be politely obtuse to troll them right back.

Turing's version involves experts who definitely aren't in the same room waving to each other, but the fundamental problem is it isn't a particularly good test

Is there a name for the reverse Turing test? How can a Python script convince me it's not actually a human?
That's actually the critical part and much more relevant than the exact scores achieved.

According to the linked article, the main reason Eliza got such a pass is because the testers were looking for a ChatGPT-esque giveaway. Long-winded, frightened of controversy, prone to hallucinations.

Which Eliza is not. And (presumably) not being familiar with Eliza, they thought it was another bored human test subject putting in the absolutely bare minimum effort.

Eliza didn't pass for a human - it passed for not-a-LLM.

Yea, It's really hard to get GPT to sound human because the RLHF really wants to let you know it's not a human.

GPT4 + a RLHF that was trained to think it was human would be a much different beast.

Yeah GPT4 is not trained to beat turing test, it is trained to be an AI assistant.

Imagine you take a human and train them to be an AI assistant since they were a baby. I imagine their behaviour would also be very odd compared to average people.

Unfortunately, the paper only provides the text of one of the prompts they used (Juliet) and it happens to be one of the worst performing ones that scored lower than ELIZA. I suppose you could qualify the quote by saying that the best prompt used with GPT-4 had a 41 percent success rate. I don't think that's more of an omission than excluding the GPT model, excluding the prompt used, and ignoring the fact that other GPT models beat ELIZA.