Hacker News new | ask | show | jobs
by observationist 529 days ago
Something like: A black box is unknowable, a gray box can be figured out in principle, a white box is fully known. A pocket calculator is fully known. LLMs are (dark) gray boxes - we can, in principle, figure out any particular sequence of computations, at any particular level you want to look at, but doing so is extremely tedious. Tools are being researched and developed to make this better, and mechinterp makes progress every day.

However - even if, in principle, you could figure out any particular sequence of reasoning done by a model, it might in effect be "secured" and out of reach of current tools, in the same sense that encryption makes brute forcing a password search out of reach of current computers. 128 bits might have been secure 20 years ago, but take mere seconds now, but 8096 bits will take longer than the universe probably has, to brute force on current hardware.

There could also be, and very likely are, sequences of processing/ machine reasoning that don't make any sense relevant to the way humans think. You might have every relevant step decomposed in a particular generation of text, and it might not provide any insight into how or why the text was produced, with regard to everything else you know about the model.

A challenge for AI researchers is broadly generalizing the methodologies and theories such that they apply to models beyond those with the particular architectures and constraints being studied. If an experiment can work with a diffusion model as well as it does with a pure text model, and produces robust results for any model tested, the methodology works, and could likely be applied to human minds. Each of these steps takes us closer to understanding a grand unifying theory of intelligence.

There are probably some major breakthroughs in explainability and generative architectures that will radically alter how we test and study and perform research on models. Things like SAEs and golden gate claude might only be hyperspecific investigations of how models work with this particular type of architecture.

All of that to say, we might only ever get to a "pale gray box" level of understanding of some types of model, and never, in principle, to a perfectly understood intelligent system, especially if AI reaches the point of recursive self improvement.

1 comments

One important point (I think) is whether the cause or outcome of the box can be understood or predicted without full emulation of the entire box. Can it be distilled down to a more simple set of rules, or is it a chaotic system that turns into a different system if any part of it is removed?

That is, can you trace unequivocally the reason an LLM produced a certain token without, in effect, recreating the LLM and asking it the same question again?