The example ELIZA responses in the paper are so laughably bad and trivial to pick up, I'm not convinced the human interrogators were sober/conscious/awake during the experiment.
tbf the human side of those conversations isn't much better. I think if someone tried prompt injection hacks on me I'd be tempted to be politely obtuse to troll them right back.
Turing's version involves experts who definitely aren't in the same room waving to each other, but the fundamental problem is it isn't a particularly good test
That's actually the critical part and much more relevant than the exact scores achieved.
According to the linked article, the main reason Eliza got such a pass is because the testers were looking for a ChatGPT-esque giveaway. Long-winded, frightened of controversy, prone to hallucinations.
Which Eliza is not. And (presumably) not being familiar with Eliza, they thought it was another bored human test subject putting in the absolutely bare minimum effort.
Eliza didn't pass for a human - it passed for not-a-LLM.
Turing's version involves experts who definitely aren't in the same room waving to each other, but the fundamental problem is it isn't a particularly good test