Hacker News new | ask | show | jobs
by foundry27 580 days ago
For anyone who hasn’t seen this before, mechanistic interpretability solves a very common problem with LLMs: when you ask a model to explain itself, you’re playing a game of rhetoric where the model tries to “convince” you of a reason for what it did by generating a plausible-sounding answer based on patterns in its training data. But unlike most trends of benchmark numbers getting better as models improve, more powerful models often score worse on tests designed to self-detect “untruthfulness” because they have stronger rhetoric, and are therefore more compelling at justifying lies after the fact. The objective is coherence, not truth.

Rhetoric isn’t reasoning. True explainability, like what overfitted Sparse Autoencoders claim they offer, basically results in the causal sequence of “thoughts” the model went through as it produces an answer. It’s the same way you may have a bunch of ephemeral thoughts in different directions while you think about anything.

6 comments

I want to point out here that people do the same: a lot of the time we don't know why we thought or did something, but we'll confabulate plausible-sounding rhetoric after the fact.
The split-brain experiment is one of my favorites! https://www.youtube.com/watch?v=wfYbgdo8e-8
Not in math.
Yes in math. Formalisms come after casual thoughts, at every step.
It's totally different: those formalisms are in a workbench, following a set of rules that either work or not.

So, yes, that (math) is representative of the actual process: pattern recognition gives you spontaneous ideas, that you assess for truthfulness in conscious acts of verification.

What is a casual thought that you cannot explain in math?
That question makes no sense. You can explain anything in math, because math is a language and lets you define whatever terms and axioms you need at a given moment.

(Whether or not such explanation is useful for anything is another issue entirely.)

Can you explain how intuition led you to try a certain approach?
Math comes from brains.
That's some misunderstanding of the human brain and thought process...
/Some/ people bullshit themselves stating the plausible; others check their hypotheses.

The difference is total in both humans and automated processes.

Everyone, every last one of us, does this every single day, all day, and only occasionally do we deviate to check ourselves, and often then it's to save face.

A Nobel prize was given for related research to Daniel Kahneman.

If you think it doesn't apply to you, you're definitely wrong.

> occasionally

Properly educated people do it regularly, not occasionally. You are describing a definite set of people. No, it does not cover all.

Some people will output a pre-given answer; some people check.

Edit: sniper... Find some argument.

Your decisions shape your preferences just as much as your preferences shape your decisions and you're not even aware of it. Yes, everybody regularly confabulates plausible sounding things that they themselves genuinely believe to be the 'real reason'. You're not immune or special.

https://pmc.ncbi.nlm.nih.gov/articles/PMC3196841/

I will check the article with more attention as soon as I will have the time, but: putting aside a question on how would a similar investigation prove that all people would function in the same way,

that does not seem to counter that some people «check their hypotheses» - as duly. Some people do exercise critical thinking. It is an intentional process.

By the way: I have seldom come across a post so weak.

> every last one of us

And how do you prove it.

> A Nobel prize was given

So?

> If you think, you

Prove it.

Support it, at least. That is not discussion.

How are you going to check your hypotheses for why you preferred that jacket to that other jacket?
Do not lose the original point: some systems have a goal to sound plausible, while some have a goal to say the truth. Some systems, when asked "where have you been", will reply "at the baker's" because it is a nice narrative in their "novel writing, re-writing of reality", some other will check memory and say "at the butcher's", where they have actually been.

When people invent explicit reasons on why they turned left or right, those reasons remain hypotheses. The clumsy will promote those hypotheses to beliefs. The apt will keep the spontaneous ideas as hypotheses, until the ability to assess them comes.

Everybody promotes these sorts of hypotheses to beliefs because it's not a conscious decision you are aware of. It's not about being clumsy or apt. You don't have much control over it.

https://pmc.ncbi.nlm.nih.gov/articles/PMC3196841/

https://pure.uva.nl/ws/files/25987577/Split_Brain.pdf

It does not matter, that there may be a tendency towards bad thinking: what matters is the possibility of proper thinking and the training towards it (becoming more and more proficient at it and practicing it constantly, having it as your natural state; in automation, implementing it in the process).

What you control is the intentional revision of thought.

(I am acquainted with earlier studies about the corpus callosum but I do not know why you would mention that, what it would prove: maybe you could be clearer? I do not see how it could affect the notion of critical thinking.)

Is that example representative for the LLM tasks for which we seek explainability ?
Are we holding LLMs to a higher standard than people?
Ideally yes, LLMs are tools that we expect to work, people are inherently fallible and (even unintentionally) deceptive. LLMs being human-like in this specific way is not desirable.
A{rt,I} imitating life

I believe that's why humans reason too. We make snap judgements and then use reason to try to convince others of our beliefs. Can't recall the reference right now but they argued that it's really a tool for social influence. That also explains why people who are good at it find it hard to admit when they are wrong - they're not used to having to do it because they can usually out argue others. Prominent examples are easy to find - X marks de spot.

I wonder if the reference you are reaching for, if it's not the Jonathan Haidt book suggested by a sibling comment, is The Enigma of Reason by the cognitive psychologists Hugo Mercier and Dan Sperber (2017).

In that book (quoting here from the abstract), Mercier and Sperber argue that reason 'is not geared to solitary use, to arriving at better beliefs and decisions on our own', but rather to 'help us justify our beliefs and actions to others, convince them through argumentation, and evaluate the justifications and arguments that others address to us'. Reason, they suggest, 'helps humans better exploit their uniquely rich social environment'.

They resist the idea (popularized by Daniel Kahneman) that there is 'a contrast between intuition and reasoning as if these were two quite different forms of inference', proposing instead that 'reasoning is itself a kind of intuitive inference'. For them, reason as a cognitive mechanism is 'much more opportunistic and eclectic' than is implied by the common association with formal systems like logic. 'The main role of logic in reasoning, we suggest, may well be a rhetorical one: logic helps simplify and schematize intuitive arguments, highlighting and often exaggerating their force.'

Their 'interactionist' perspective helps explain how illogical rhetoric can be so socially powerful; it is reason, 'a cognitive mechanism aimed at justifying oneself and convincing others', fulfilling its evolutionary social function.

Highly recommended, if you're not already familiar.

Thank you. That's exactly the idea and described much more eloquently. I probably heard it through the Sapolsky lecture from a sibling comment but that captures it exactly. Bookmarked.
I think Robert Sapolsky's lectures on yt cover this to some degree around 115.

https://youtu.be/wLE71i4JJiM?feature=shared

Sometimes our cortex is in charge, sometimes other parts of our brain are, and we can't tell the difference. Regardless, if we try to justify it later, that justification isn't always coherent because we're not always using the part of our brain we consider to be rational.

Yes that was probably it because I rewatched that recently. Thanks!
People who are good at reasoning find it hard to admit that they were wrong?

That’s not my experience. People with reason are.. reasonable.

You mention X and that’s not where the reasoners are. That’s where the (wanna be) politicians are. Rhetoric is not all of reasoning.

I can agree that rationalizing snap judgements is one of our capabilities but I am totally unconvinced that it is the totality of our reasoning capabilities. Perhaps I misunderstood.

This is not totally my experience, I've debated a successful engineer who by all accounts has good reasoning skills, but he will absolutely double down on unreasonable ideas he's made on the fly he if can find what he considers a coherent argument behind them. Sometimes if I absolutely can prove him wrong he'll change his mind.

But I think this is ego getting in the way, and our reluctance to change our minds.

We like to point to artificial intelligence and explain how it works differently and then say therefore it's not "true reasoning". I'm not sure that's a good conclusion. We should look at the output and decide. As flawed as it is, I think it's rather impressive

> ego getting in the way

That thing which was in fact identified thousands of years ago as the evil to ditch.

> reluctance to change our minds

That is clumsiness in a general drive that makes sense and is recognized part of the Belief Change Theory: epistemic change is conservative. I.e., when you revise a body of knowledge you do not want to lose valid notions. But conversely, you do not want to be unable to see change or errors, so there is a balance.

> it's not "true reasoning"

If it shows not to explicitly check its "spontaneous" ideas, then it is a correct formula to say 'it's not "true reasoning"'.

> then it is a correct formula to say 'it's not "true reasoning"'

why is that point fundamental?

Because the same way you do not want a human interlocutor to speak out of its dreams, uttering the first ideas that come to mind unvetted, and you want him to instead have thought hard and long and properly and diligently and well, equally you'll want the same from an automation.
The smarter a person is, the better they are at rationalizing their decisions. Especially the really stupid decisions.
People with reason ... sound reasonable.

I think some prominent people on X who are good at reasoning from First Principles will double down on things rather than admit their mistake.

The other very prominent psychological phenomenon I have observed in the world is "Projection", i.e. the phenomenon of seeing qualities in other people that we have ourselves. I guess it is because we think others would do what we would do ourselves. Trump is a clear example of this - whatever he accuses someone else off, you know he is doing. Point here being that this doubling down on bad reasons in order to not admit my mistakes is something I've observed in myself. Reason does indeed help me to try and overcome it when I recognise it but the tricky part is being able to recognise it.

Already before Galileo we had experiments to determine whether ideas represented reality or not. And in crucial cases, long before that, it meant life or death. This will be clear to engineers.

«Reason» is part of that mechanism of vetting ideas. You experience massive failures without it.

So, no, trained judgement is a real thing, and the presence of innumerable incompetent do not prove an alleged absence of the competent.

Jonathan Haidt's The Righteous Mind describes this ín details.
Thanks
A lot of the mech interp stuff has seemed to me like a different kind of voodoo: the Integer Quantum Hall Effect? Overloading the term “Superposition” in a weird analogy not governed by serious group representation theory and some clear symmetry? You guys are reaching. And I’ve read all the papers. Spot the postdoc who decided to get paid.

But there is one thing in particular that I’ll acknowledge as a great insight and the beginnings of a very plausible research agenda: bounded near orthogonal vector spaces are wildly counterintuitive in high dimensions and there are existing results around it that create scope for rigor [1].

[1] https://en.m.wikipedia.org/wiki/Johnson%E2%80%93Lindenstraus...

Superposition code is a well known concept in information theory - I think there is certainly more to the story then described in the current works, but it does feel like they are going in the right direction
Where are you seeing the integer quantum Hall effect mentioned? Or are you bringing it up rather than responding to it being brought up elsewhere? I don’t understand what the connection between IQHE and these SAE interpretability approaches is supposed to be.
Pardon me, the reference is to the fractional Hall effect.

"But our results may also be of broader interest. We find preliminary evidence that superposition may be linked to adversarial examples and grokking, and might also suggest a theory for the performance of mixture of experts models. More broadly, the toy model we investigate has unexpectedly rich structure, exhibiting phase changes, a geometric structure based on uniform polytopes, "energy level"-like jumps during training, and a phenomenon which is qualitatively similar to the fractional quantum Hall effect in physics, among other striking phenomena. We originally investigated the subject to gain understanding of cleanly-interpretable neurons in larger models, but we've found these toy models to be surprisingly interesting in their own right."

https://transformer-circuits.pub/2022/toy_model/index.html

BTW, it's easy to test model's logic and truthfulness by giving it a wrong decision is if it was its, and asking to explain. Model has no memory and cannot distinguish the source of the text. 'Truthful' model should admit mistake without being asked. Likely model instead will do 'parallel construction' to support 'its' decision.
How does the causality part work? Can it spit out a graphical model?
I stopped at: "causal sequence of “thoughts” "
Interpretability research is basically a projection of the original function implemented by the neural network onto a sub-space of "explanatory" functions that people consider to be more understandable. You're right that the words they use to sell the research is completely nonsensical because the abstract process has nothing to do with anything causal.
All code is causal.
Which makes it entirely irrelevant as a descriptive term.
"Servers shall be strict in formulation and flexible in interpretation."