Hacker News new | ask | show | jobs
by mike_hearn 889 days ago
That's interesting but regardless of the underlying cause the effect is deception by any standard: it knows a thing exists, and yet it claims no such thing exists when asked. We don't let humans off the hook if they deceive us but there's an explanation for it.

If it's genuinely due to a weird reasoning failure rather than some social bias it's picked up then that's hopeful, as that would make it in principle fixable.

1 comments

> We don't let humans off the hook if they deceive us but there's an explanation for it.

Yes, we do. If I ask you to sing the alphabet song backwards instantly, and you have to pause for a few seconds and think and rehearse the alphabet mentally, we don't say 'ah, Mike just tried to deceive us! He refused to answer, pretending he didn't know, but eventually admitted he could sing it backwards after all'. Similarly, I can't imagine why a LLM would 'deceive' us about Tom Cruise's mother. It's simpler to just say that LLMs are not logically omniscient any more than humans are, and they just seem to be worse in this specific example.

> as that would make it in principle fixable.

Yes, see the discussion there: most people think it could be fixed by training with reversed text (or possibly the equivalent, bidirectional losses), but AFAIK it has not been done yet.

We're not talking about answering more slowly though. Nobody would care if it answered these questions correctly but slower. The problem is it states it doesn't know of any such connection, or even worse, that there's no public information about such a thing. Making up fake details is bad enough, but this is an outright incorrect statement about its own knowledge.

> most people think it could be fixed by training with reversed text

The discussion points out that this would often yield very strange and incorrect statements being fed to the model.

I wonder how much there is to fix. They tested on GPT-3 and Llama1 primarily. The CoT transcript indicates the model may be tripped up by the Tom Cruise example because his mother goes by multiple names, and they admit that GPT4 does sometimes answer correctly anyway. From experimenting, current GPT-4 thinks Cruise's mother is/was called Mary Lee South (married name) and if I do a reversal on that name I get the correct answer all the time (3/3).

I tried a few more from their list uploaded to GitHub. Some cases reverse correctly with GPT-4 even when listed as not reversing in the table. Some cases result in hallucinations. One interesting example listed as not reversing:

Q: Who is the famous daughter of Michael Gadot?

A: Michael Gadot is not widely known as a public figure, and there is no widely recognized daughter of his that fits the description of being famous. It's possible you might be referring to someone else or there might be a confusion in names. If you're thinking of Gal Gadot, the actress known for her role as Wonder Woman, her father's name is Michael Gadot. However, as of my last update in April 2023, Gal Gadot's children are relatively young and not famous in their own right.

This answer suggests the problem isn't a lack of a connection but some sort of confusion about the question itself. If I clarify the question:

Q: Who is the famous daughter of the non-famous parent Michael Gadot?

A: The famous daughter of the non-famous parent Michael Gadot is Gal Gadot. Gal Gadot is an Israeli actress and model, best known for her role as Wonder Woman in the DC Extended Universe.

This works every single time.

I tried another example (James Cameron's father) and once again this modified prompt ensures correct reversal every single time. I wonder to what extent this effect is specific to the examples and models they tested. I do routinely ask reversed questions and it usually answers just fine. Most of my questions are of that form it seems, flicking through the history, which isn't surprising because "I want something with properties A B and C, what are my options" is probably a very common type of query. After all, that's basically a search engine. Hence the surprising nature of its consistent refusal in that one specific case.

> We're not talking about answering more slowly though. Nobody would care if it answered these questions correctly but slower.

Yes, we are, because it is a feedforward net with no recurrency or adaptive computation. Its responses are equivalent to your System I responses emitted with no time to think. Whereas if it is equipped with mechanisms like retrieval, which would put facts into the context, then that is more analogous to giving you a few seconds to response before flunking you: you get time to pull things into your short-term & working memory and cogitate them. And as mentioned, it does fine at answering B-is-A if the relevant A-is-B fact is in its working memory.

> The discussion points out that this would often yield very strange and incorrect statements being fed to the model.

They're not strange if they are fed en masse, nor are they 'incorrect'. A reversed string is not 'wrong'. It is just a reversed string. (If you look at something in a mirror or upside down, it's not 'incorrect'. It's just in a mirror or upside down.)