Hacker News new | ask | show | jobs
by yeck 1035 days ago
I have a hard time understanding why mechanistic interpretability has so few eyes on it. It's like trying to build a complex software system without logging or monitoring. Any other improvements you want to make on the system are going to just be trail and error with luck. The hallucination problem is one where interpretability of a model might be able to identify the failure modes that we need to address. Really any AI problem could likely be aided by a scalable approach to interpretability that is just as mundane feeling as classical software observability.
2 comments

I'm going to talk out of my ass here because I am not involved enough to know the mechanics of how LLMs are really trained at any deep level, but from the surface level understanding I have, I would expect any attempt to eliminate hallucination to be intractable given the techniques in use. As far as I understand, the initial training run is simply fed raw text and it works on the basis of predicting a next token. Then these are find-tuned using RLHF and potentially other techniques I don't know much about.

To truly eliminate hallucinations, I would think you'd have to change the initial training phase. Rather than only feeding raw text and predicting next tokens, you'd need to feed propositions labeled with some probability that they are actually true. Doing this with real fidelity is clearly not possible. No one has a database of all fact claims quantified by probability of truth. But you could potentially use the same heuristics used by human learners and impart some encoding of hierarchy of evidence. Give high weight to claims made by professional scientific organizations, high but somewhat lesser to conclusions of large-scale meta-analyses in relatively mechanistic fields, give very low weight to comments on Reddit.

That is all entirely possible but the manual human labor required seems antithetical to the business goals of anyone actually doing this kind of research. Without it, though, you're seemingly limited to either playing whack-a-mole with fine tuning out specific classes of error when they're caught or relying on a dubious assumption that plausibly human-generated utterances you're trying to mimic are sufficiently more likely to be true than false.

This problem arguably goes away if people treat LLMs for what they are, generators of strings that look like plausible human-generated utterances, rather than generators of fact claims likely to be true. But if we really want strong AI, we clearly need the latter. There is a reason epistemologists have long defined knowledge as justified true belief, not just incidentally lucking into being correct.

If you could know that this is the case with interpretability tools than we would be able to train new models with purposeful decisions to reduce or remove hallucinations. Narrow the range of the tests and experiments you need to do to solve the problem. Otherwise we are mostly speculating about why stuff doesn't work and play a game of darts in the dark.
When I looked into this briefly my impression was that it's extremely hard to do mechanistic interpretation beyond very simple cases like CNN classification or toy problems like arithmetic in transformers. Not to say it's not a worthy pursuit, but I think the difficulty isn't justified for many researchers since the results won't make a big splash like a new model training result.
Yeah, it is harder than other things, but if we can train a model to explain collections of pixels in human language then we might be able to do similar with collections of activations.

I don't know if that is the direction, but just an example that comes to mind easily.

If someone figures out how to do this, I think their models will be far more capable and reliable.