Just the other day there was a double-blind study that showed a 50-50 success rate in guessing whether you were interacting with a person or GPT. That’s a turning test pass, no?
then it wasn't a canonical Turing test. The preprint accurately describes and analyzes their (indefensibly bad) experiment, but the popular press has mischaracterized it.
The canonical test gives the interrogator two witnesses, one human and one machine, and asks them to judge which witness is human. The interrogator knows that exactly one witness is human. In that test, a 50% chance of a right answer means the machine is indistinguishable from human. (Turing actually proposed a lower pass threshold, perhaps for statistical convenience.)
But that study gave the interrogator one witness, and asked them to judge whether it was human. The interrogator wasn't told anything about the prior probability that their witness was human. The probabilities that a real human is judged human and that GPT-4 is judged human sum to >100%, since nothing stops that since it's not a binary comparison. So 50% has no particular meaning. The result is effectively impossible to interpret, since it's a function both of the witness's performance and of whatever assumption the interrogator makes about the unspecified prior.
I a 5 minute casual conversation.
Also the statistics between human and AI were different in some regard (like 48% vs 56% for some quantity), I dont recall details.
Look the Turing test is very different depending on the details, and I think a lame 5min Turing test that doesnt really measure anything of i terest is a wirse concept than a 1 day adversarial expert team test thqt can detect AGI.
That linked study doesn't particularly resemble Turing's test, though? The authors asked an LLM some questions (like personality tests, or econ games), then reduced the responses to low-dimensional aggregates (like into "Big Five" personality traits), and compared those aggregates against human responses to the same questions. They found those aggregates to be indistinguishable, but that aggregation throws away almost all the information a typical human interrogator would use to judge.
Turing's interrogator also gets to ask whatever questions they think will most effectively distinguish human from machine. Everything those authors asked must appear in the training set countless times (and also corresponds closely to likely RLHF targets), making it a particularly unhelpful choice.
Turing was a WW2 era mathematician. He had no insight or understanding of intelligence, made no study of intelligent systems, and so on (he believed in ESP of all things).
Turing's test is a restatement of a now pseudoscientific behaviourism common at the time; and also, egregiously, places a dumb ape as the system which measures intelligence. If an ape can be fooled, the system is intelligent: people worshiped the sun and thought it conscious. People are desperate to analogise the world to themselves, it is a trivial thing to fool an ape on this matter.
Whatever one might make of this as a philosophical thought experiment, as a test for intelligence, its pseudoscience. What a person might, or might not believe, about a series of words sent across a wire isn't science and it isnt relevant to a discussion about the capabilities of an AI system. It is a measure, only, of how easily deceived we are.
That has nothing to do with why turing proposed it; nor does it have anything to do with general intelligence. This is just pseudoscience.
There's no scientific account of the capacities of a system with intelligence, no account of how these combine, no account of how communicative practices arise, etc. None. Any such attempt would immediately expose the "test" as ridiculous.
General intelligence arises as skillful adaptive control over one's environment, through sensory-motor concept aquistion, and so on.
It has absolutely nothing to do with whether you can emit text tokens in the right order to fool a user about whether the machine is a man or a woman (turing's actual test). Nor does it have anything to do with whether you can fool a person at all.
No machine whose goal is to fool a user about the machine's intelligence has thereby any capacities. Kinda, obviously.
Turing's test not only displays a gross lack of concern to produce any capacities of intelligence in a system; as a research goal, it's actively hostile to the production of any such capacities. Since it is trivial to fool people; this requires no intelligence at all.
> General intelligence arises as skillful adaptive control over one's environment, through sensory-motor concept aquistion, and so on.
This isn't a generally accepted definition or process.
And indeed it seems to preclude people like Stephen Hawkins who had little control over his environment (or to be pedantic, people who had similar conditions from birth).
For the purposes of my criticism of the Turing test, any discussion whatsoever about what capacities ground intelligence is already entertaining what Turing ruled out. He made the extremely pseudoscientific behaviourist assumption that no such science was required, that intelligent agents are just input-output relata on thin I/O boundaries.
Any even plausible scientific account of what capacities ground intelligence would render this view false. Whatever capacities you want to grant, no plausible ones are compatible with Turing's view nor the Turing test.
Consider imagination. You can replace a faculty to imagine with a set of models of ({prompt, reply},) histories for a human observer who is only concerned with those prompts and those replies. But as soon as anything changes in the world, you have to imagine novel things (eg., SpaceX is founded, we visit mars, a new TV show is released...). So questions such as, "what would the latest SpaceGuys TV show be like if Elon handed just launched BlahBlahRocket5 ?" cannot be given fit answers). These require the actual faculty of imagination, along with being in the world and so on.
As soon as you enter a sincere scientific attempt to characterise these features, you see immediately that whilst modelling historical frequencies of human-produced data can fool humans, it cannot impart these capacities.
Yea we don't have any science of intelligence, the only thing we have is empirical data. Testing to see what works. That's why Turing tests are quite fundamental imo.