Hacker News new | ask | show | jobs
by caddemon 1114 days ago
After playing the game they used (linked at top of article) I find it hard to draw much conclusion from this study. There is a quite short timer on not only the entire conversation, but on each response you can type. When the timer runs out it sends your message in partially written form. It seriously stifles what you can ask the other "person" and it makes responses artificially short even to a deeper question. When conversation is so stunted of course it is harder to distinguish bot and human.

I'm also curious what study participants were told beforehand. If someone only had experience playing around with ChatGPT they might assume they should use a "detect GPT" strategy. Some of those strategies are pretty specific to the safety features that OpenAI implemented. But the LLM here will gladly curse at you or whatever. On the other hand I suspect it is less good than GPT - not that it matters so much when the entire conversation is exchanging single sentences.

5 comments

I can't find it right now, but a chatbot that did quite well on Turing tests maybe 25-ish years ago was one that just took offense to whatever you said and started insulting you.

[edit]

Not sure if it was this one, but it is from over 30 years ago: https://humphryscomputing.com/Turing.Test/08.chapter.html

From the conclusion, a message that's applicable today:

"To date, AI has been held back, we argue, by the need for a single lab, even a single researcher, to fully understand the components of the system. As a result, only small minds have been built so far. The WWM argues that we must give up this dream of full understanding as we build more and more complex systems. And giving up this dream of full understanding is not a strange thing to do. It is what has always happened in other fields. It is how humanity has made its most complex things."

Now I am imagining a conversational AI exclusively trained on transcripts from Halo matches, scary.

That said I have always felt like AI (and adjacent) has been lacking an appropriate amount of snark - when I take a wrong turn I feel like the GPS voice needs a bit more 'learn to drive dumb###' and a little less 're-routing'.

I once dated a Navteq employee who took particular offense at me missing turns while using a GPS unit that contained POI data she collected during the course of her job.

A simulacrum of that experience would probably be more amusing than the real thing.

Babylon 5 did a bit on this: https://www.youtube.com/watch?v=F_r7sh75258
> There is a quite short timer on not only the entire conversation

couldn't agree more and they took like 30 seconds to type a few words.

if i really have been talking to a human here, i can only suspect heavy usage of drugs: https://ibb.co/CHG2VcS

kinda seems like this is fake or maybe i am not aware that "elbows" are a thing you can be into now - maybe a trending new fetish?

I bet there are people purposely screwing with it, since they designed it to be 2 sided when you get paired with a human. The actual Turing test is not supposed to be this way, though it still relies on at least some assumption of good faith (or properly incentivized) participants.
> if i really have been talking to a human here, i can only suspect heavy usage of drugs

Does the human participants have any incentive to try to convince you about their human-ness? My initial guess would be not that they are on drugs, but that they are messing with you.

"sexy elbows fetish" is an old meme
Additionally, I'd like to know how they corrected for this: "In a creative twist, many people pretended to be AI bots themselves in order to assess the response of their chat partners"

Assuming it actually was "many people", then whenever they have a human conversational partner (who also would be voting at the end), that person is going to have a hard time and skew the results.

Like imagine playing this game as a lay person after having used ChatGPT a little bit and then getting a response to your question that says "as a large language model ...". Depending on how well the game was explained to participants, it's possible that some people even did this intentionally to fuck with results.

In a proper Turing test there is supposed to be 1 bot and 2 humans, where one human is incentivized only to demonstrate they are human and the other human is the one asking probing questions and needing to guess which is which (but is already known to be human).

Anyway I've only read the linked article and played the game a couple times, I didn't look through the original research publication. It's certainly possible they did address some of these issues, but it is such a buzzword topic at the moment that I have my doubts. And regardless the linked article should cover limitations. For exactly this reason it is important that we have higher expectations for the quality of general audience writing about AI.

Ok I have to add one more thing that's funny since I just played a couple more times: if your conversational partner is a human and they exit the window mid-chat, it still lets you vote.
I had a bot send 1 message and then quit. I naturally assumed that bots wouldn't just quit, and 1 message isn't really enough to gauge anything, so yeah based on this experience I wouldn't draw much of a conclusion from it.
Interesting, I had a human quit, surprised bots might quit as well.
I don't know if they discarded any chats that ended before the timer, or without a sufficient number of messages, but that feels important for drawing conclusions.
I want to illustrate how easy it is to figure out the game: https://i.imgur.com/ezLVvo4.png

Took literally one message. You don't need much to totally wreck an AI, you just need to know the weak points.