|
For anyone who hasn’t seen this before, mechanistic interpretability solves a very common problem with LLMs: when you ask a model to explain itself, you’re playing a game of rhetoric where the model tries to “convince” you of a reason for what it did by generating a plausible-sounding answer based on patterns in its training data. But unlike most trends of benchmark numbers getting better as models improve, more powerful models often score worse on tests designed to self-detect “untruthfulness” because they have stronger rhetoric, and are therefore more compelling at justifying lies after the fact. The objective is coherence, not truth. Rhetoric isn’t reasoning. True explainability, like what overfitted Sparse Autoencoders claim they offer, basically results in the causal sequence of “thoughts” the model went through as it produces an answer. It’s the same way you may have a bunch of ephemeral thoughts in different directions while you think about anything. |