|
|
|
|
|
by yeck
1035 days ago
|
|
I have a hard time understanding why mechanistic interpretability has so few eyes on it. It's like trying to build a complex software system without logging or monitoring. Any other improvements you want to make on the system are going to just be trail and error with luck. The hallucination problem is one where interpretability of a model might be able to identify the failure modes that we need to address. Really any AI problem could likely be aided by a scalable approach to interpretability that is just as mundane feeling as classical software observability. |
|
To truly eliminate hallucinations, I would think you'd have to change the initial training phase. Rather than only feeding raw text and predicting next tokens, you'd need to feed propositions labeled with some probability that they are actually true. Doing this with real fidelity is clearly not possible. No one has a database of all fact claims quantified by probability of truth. But you could potentially use the same heuristics used by human learners and impart some encoding of hierarchy of evidence. Give high weight to claims made by professional scientific organizations, high but somewhat lesser to conclusions of large-scale meta-analyses in relatively mechanistic fields, give very low weight to comments on Reddit.
That is all entirely possible but the manual human labor required seems antithetical to the business goals of anyone actually doing this kind of research. Without it, though, you're seemingly limited to either playing whack-a-mole with fine tuning out specific classes of error when they're caught or relying on a dubious assumption that plausibly human-generated utterances you're trying to mimic are sufficiently more likely to be true than false.
This problem arguably goes away if people treat LLMs for what they are, generators of strings that look like plausible human-generated utterances, rather than generators of fact claims likely to be true. But if we really want strong AI, we clearly need the latter. There is a reason epistemologists have long defined knowledge as justified true belief, not just incidentally lucking into being correct.