Hacker News new | ask | show | jobs
by throwaway4aday 896 days ago
> Interestingly, Eliza still outperforms ChatGPT in certain Turing test variations.

I see we have a new entry for the 2024 Lies of Omission award.

The article linked to plainly shows that Eliza only beats ChatpGPT-3.5 and is in the bottom half when ranked against a variety of different system prompts. An excellent ass covering strategy that relies on the reader not checking sources.

An honest author would have actually quoted the article saying:

> GPT-4 achieved a success rate of 41 percent, second only to actual humans.

instead of constructing a deliberately misleading paraphrase.

2 comments

"GPT-4 achieved a success rate of 41 percent, second only to actual humans" also feels like a (much bigger) lie of omission looking at the original paper. GPT4's performance was in the range of 6% to 41%, Eliza's 27% score sat in the upper middle of that range, and considering the bots tested consisted of 8 GPT4 prompts, 2 GPT3.5 prompts and a naive script from the 1960s, GPT4 would have had to be remarkably consistently inhuman not to finish "second only to humans" with its highest scoring prompt

The blog appears to have been updated to specify GPT3.5, but the original version was accurate.

The paper itself is interesting as it covers the limitations (it has big methodological issues), how the GPT prompts attempted to overridei default chatGPT tone and reasons why ELIZA performed surprisingly well (some thought it was so uncooperative, it must be human!) https://arxiv.org/pdf/2310.20216.pdf

The example ELIZA responses in the paper are so laughably bad and trivial to pick up, I'm not convinced the human interrogators were sober/conscious/awake during the experiment.
tbf the human side of those conversations isn't much better. I think if someone tried prompt injection hacks on me I'd be tempted to be politely obtuse to troll them right back.

Turing's version involves experts who definitely aren't in the same room waving to each other, but the fundamental problem is it isn't a particularly good test

Is there a name for the reverse Turing test? How can a Python script convince me it's not actually a human?
That's actually the critical part and much more relevant than the exact scores achieved.

According to the linked article, the main reason Eliza got such a pass is because the testers were looking for a ChatGPT-esque giveaway. Long-winded, frightened of controversy, prone to hallucinations.

Which Eliza is not. And (presumably) not being familiar with Eliza, they thought it was another bored human test subject putting in the absolutely bare minimum effort.

Eliza didn't pass for a human - it passed for not-a-LLM.

Yea, It's really hard to get GPT to sound human because the RLHF really wants to let you know it's not a human.

GPT4 + a RLHF that was trained to think it was human would be a much different beast.

Yeah GPT4 is not trained to beat turing test, it is trained to be an AI assistant.

Imagine you take a human and train them to be an AI assistant since they were a baby. I imagine their behaviour would also be very odd compared to average people.

Unfortunately, the paper only provides the text of one of the prompts they used (Juliet) and it happens to be one of the worst performing ones that scored lower than ELIZA. I suppose you could qualify the quote by saying that the best prompt used with GPT-4 had a 41 percent success rate. I don't think that's more of an omission than excluding the GPT model, excluding the prompt used, and ignoring the fact that other GPT models beat ELIZA.
Hum, note that this was not an argument about or against GPT, but about the "unreasonable" success of a, by all standards, primitive algorithm that manages to get (somewhat) away by crafting the pre- and context of the conversation. By no means, on the other hand, could I read this article and understand it as claiming any superiority over modern applications.

(Nobody with even the crudest understanding of the principles of Eliza could claim this, and the article clearly demonstrates a detailed understanding. Disclaimer: I wrote the JS implementation linked in the article, many years ago.)

Edit: The question rightfully raised – and answered – by Eliza, which is still relevant today in the context of GPT, is: does the appearance of intelligent conversation (necessarily) hint at the presence of a world model in any rudimentary form?

Several people in this thread appear to have misunderstood due to the way this article was written.