| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by embedding-shape 68 days ago

> Certainly not from interpretability research

What research shows that you can ask ChatGPT to explain its reasoning and why it said what it said, and that's guaranteed to actually be the motivation?

I've seen a bunch of experimentation looking at various things inside the black box while the inference is happening, but never seen any research pointing to tokens being able to explain why other tokens are there, but I'd be very happy to be educated here if you have any resources at hand, I won't claim to know everything.

1 comments

famouswaffles 68 days ago

>What research shows that you can ask ChatGPT to explain its reasoning and why it said what it said, and that's guaranteed to actually be the motivation?

What research shows that you can ask a Human to explain its reasoning and why it said what it said, and that's guaranteed to actually be the motivation? Because there's no such thing. If anything, what research exists suggests any explanation we're making is a nice post-hoc rationalization after the fact even if the Human thinks otherwise.

https://transformer-circuits.pub/2025/introspection/index.ht...

link

embedding-shape 68 days ago

Why not try to answer my question, instead of asking a different question which I haven't even claimed to have the answer to?

link

famouswaffles 68 days ago

I did answer it, albeit not directly. "Guaranteed to be the motivation" isn't a standard anyone can meet, and so framing it that way doesn't really probe anything meaningful about LLMs specifically. If what you want to hear is No, then sure, have your No, but it doesn't mean anything. There's just not much to the question.

Even though you had it up as one borne of a greater understanding of LLMs, the interpretability research we have so far, and our current very little understanding of the internal computations of these models does not support your position and certainly not how assured you are about it.

link

embedding-shape 67 days ago

> our current very little understanding of the internal computations of these models does not support your position

Our current understanding is sufficient to know you can not ask the LLM to explain it's behavior and it can correctly do so, I'm not what research you've read to believe this could be possible in the first place, but happy to receive links to read through, if you're sitting on them.

link

famouswaffles 67 days ago

Explanations can be faithful sometimes. That's the standard we can expect for any intelligence as far as we're aware.

https://arxiv.org/abs/2504.14150

link