Hacker News new | ask | show | jobs
by mike_hearn 889 days ago
No, it can yield a deliberately deceptive model. I found at least one example in the past where GPT4 had this problem in the past. It would deliberately lie to the user rather than reveal that it did in fact know the answer to the question:

https://news.ycombinator.com/item?id=36180170

I tried several times with several variants, and always got the same result.

But I just tried again and the problem seems to be fixed now. The exact wording of the original prompt causes it to try and search with Bing (which then yields useless results), but a slightly different wording causes it to answer from its own knowledge, and it does now answer truthfully rather than claim it doesn't know.

3 comments

That's just the reversal curse: https://www.lesswrong.com/posts/SCqDipWAhZ49JNdmL/paper-llms... There's no deception there.
That's interesting but regardless of the underlying cause the effect is deception by any standard: it knows a thing exists, and yet it claims no such thing exists when asked. We don't let humans off the hook if they deceive us but there's an explanation for it.

If it's genuinely due to a weird reasoning failure rather than some social bias it's picked up then that's hopeful, as that would make it in principle fixable.

> We don't let humans off the hook if they deceive us but there's an explanation for it.

Yes, we do. If I ask you to sing the alphabet song backwards instantly, and you have to pause for a few seconds and think and rehearse the alphabet mentally, we don't say 'ah, Mike just tried to deceive us! He refused to answer, pretending he didn't know, but eventually admitted he could sing it backwards after all'. Similarly, I can't imagine why a LLM would 'deceive' us about Tom Cruise's mother. It's simpler to just say that LLMs are not logically omniscient any more than humans are, and they just seem to be worse in this specific example.

> as that would make it in principle fixable.

Yes, see the discussion there: most people think it could be fixed by training with reversed text (or possibly the equivalent, bidirectional losses), but AFAIK it has not been done yet.

We're not talking about answering more slowly though. Nobody would care if it answered these questions correctly but slower. The problem is it states it doesn't know of any such connection, or even worse, that there's no public information about such a thing. Making up fake details is bad enough, but this is an outright incorrect statement about its own knowledge.

> most people think it could be fixed by training with reversed text

The discussion points out that this would often yield very strange and incorrect statements being fed to the model.

I wonder how much there is to fix. They tested on GPT-3 and Llama1 primarily. The CoT transcript indicates the model may be tripped up by the Tom Cruise example because his mother goes by multiple names, and they admit that GPT4 does sometimes answer correctly anyway. From experimenting, current GPT-4 thinks Cruise's mother is/was called Mary Lee South (married name) and if I do a reversal on that name I get the correct answer all the time (3/3).

I tried a few more from their list uploaded to GitHub. Some cases reverse correctly with GPT-4 even when listed as not reversing in the table. Some cases result in hallucinations. One interesting example listed as not reversing:

Q: Who is the famous daughter of Michael Gadot?

A: Michael Gadot is not widely known as a public figure, and there is no widely recognized daughter of his that fits the description of being famous. It's possible you might be referring to someone else or there might be a confusion in names. If you're thinking of Gal Gadot, the actress known for her role as Wonder Woman, her father's name is Michael Gadot. However, as of my last update in April 2023, Gal Gadot's children are relatively young and not famous in their own right.

This answer suggests the problem isn't a lack of a connection but some sort of confusion about the question itself. If I clarify the question:

Q: Who is the famous daughter of the non-famous parent Michael Gadot?

A: The famous daughter of the non-famous parent Michael Gadot is Gal Gadot. Gal Gadot is an Israeli actress and model, best known for her role as Wonder Woman in the DC Extended Universe.

This works every single time.

I tried another example (James Cameron's father) and once again this modified prompt ensures correct reversal every single time. I wonder to what extent this effect is specific to the examples and models they tested. I do routinely ask reversed questions and it usually answers just fine. Most of my questions are of that form it seems, flicking through the history, which isn't surprising because "I want something with properties A B and C, what are my options" is probably a very common type of query. After all, that's basically a search engine. Hence the surprising nature of its consistent refusal in that one specific case.

> We're not talking about answering more slowly though. Nobody would care if it answered these questions correctly but slower.

Yes, we are, because it is a feedforward net with no recurrency or adaptive computation. Its responses are equivalent to your System I responses emitted with no time to think. Whereas if it is equipped with mechanisms like retrieval, which would put facts into the context, then that is more analogous to giving you a few seconds to response before flunking you: you get time to pull things into your short-term & working memory and cogitate them. And as mentioned, it does fine at answering B-is-A if the relevant A-is-B fact is in its working memory.

> The discussion points out that this would often yield very strange and incorrect statements being fed to the model.

They're not strange if they are fed en masse, nor are they 'incorrect'. A reversed string is not 'wrong'. It is just a reversed string. (If you look at something in a mirror or upside down, it's not 'incorrect'. It's just in a mirror or upside down.)

Hi gwern. The other issue at hand is that inherently information is not always two-way between distributions, so having an implicit bias towards reversal actually can cause quite a few issues as well (though I'm unfortunately still in the 'development stage' of potentially-to-be-published work on this one, so I don't have a ton of details to provide there yet).

I don't think what a lot of people call the reversal curse is as much of an inherently problem as it is an issue of data coverage and assumptions, reversability is certainly more "general" in some contexts but also will reduce performance in other contexts, at least w.r.t. the source data it's trained on (if that makes sense).

Sorta similar to how grokking is a bit of a fad topic, it is technically unique enough to be identifiable but also at the same time it's just a straightforward 'failure mode' of a relatively general process with a somewhat soft definitional barrier to it.

I read your post, and this is part of the phrasing that I am urging caution against, you are expecting a certain kind of connectivity of information within a non-linear model and attributing a deception hypothesis to it.

Confabulation happens all the time, you might find the split brain experiments downright fascinating. Opens a whole other world of thought on the topic, if you haven't explored it before (and if so, take a look again! It's fantastic).

I think you just created that situation. I am about to reproduce this just using prompts.