Hacker News new | ask | show | jobs
by EncomLab 529 days ago
We should stop using the term "black box" to mean "we don't know" when really it's "we could find out but it would be really hard".

We can precisely determine the exact state of any digital system and track that state as it changes. In something as large as a LLM doing so is extremely complex, but complexity does not equal unknowable.

These systems are still just software, with pre-defined operations executing in order like any other piece of software. A CPU does not enter some mysterious woo "LLM black box" state that is somehow fundamentally different than running any other software, and it's these imprecise terms that lead to so much of the hype.

4 comments

The usual use of the term "black box" is just that you are using/testing a system without knowing/assuming anything about what's inside. It doesn't imply that what's inside is complex or unknown - just unknown to an outside observer who can only see the box.

e.g.

In "black box" testing of a system you are just going to test based on the specifications of what the output/behavior should be for a given input. In contrast, in "white box" testing you leverage your knowledge of the internals of the box to test for things like edge cases that are apparent in the implementation, to test all code paths, etc.

Yes that is the definition - but that is not what is occurring her. We DO know exactly what is going on inside the system and can determine precisely from step to step the state of the entire system and the next state of the system. The author is making a claim based on woo that somehow this software operates differently than any other software at a fundamental level and that is not the case.
Are they ? The article only mentions "black box" a couple of times, and seems to be using it in the sense of "we don't need to be concerned about what's inside".

In any case, while we know there's a transformer in the box, the operational behavior of a trained transformer is still somewhat opaque. We know the data flow of course, and how to calculate next state given current state, but what is going on semantically - the field of mechanistic interpretability - is still a work in progress.

Something like: A black box is unknowable, a gray box can be figured out in principle, a white box is fully known. A pocket calculator is fully known. LLMs are (dark) gray boxes - we can, in principle, figure out any particular sequence of computations, at any particular level you want to look at, but doing so is extremely tedious. Tools are being researched and developed to make this better, and mechinterp makes progress every day.

However - even if, in principle, you could figure out any particular sequence of reasoning done by a model, it might in effect be "secured" and out of reach of current tools, in the same sense that encryption makes brute forcing a password search out of reach of current computers. 128 bits might have been secure 20 years ago, but take mere seconds now, but 8096 bits will take longer than the universe probably has, to brute force on current hardware.

There could also be, and very likely are, sequences of processing/ machine reasoning that don't make any sense relevant to the way humans think. You might have every relevant step decomposed in a particular generation of text, and it might not provide any insight into how or why the text was produced, with regard to everything else you know about the model.

A challenge for AI researchers is broadly generalizing the methodologies and theories such that they apply to models beyond those with the particular architectures and constraints being studied. If an experiment can work with a diffusion model as well as it does with a pure text model, and produces robust results for any model tested, the methodology works, and could likely be applied to human minds. Each of these steps takes us closer to understanding a grand unifying theory of intelligence.

There are probably some major breakthroughs in explainability and generative architectures that will radically alter how we test and study and perform research on models. Things like SAEs and golden gate claude might only be hyperspecific investigations of how models work with this particular type of architecture.

All of that to say, we might only ever get to a "pale gray box" level of understanding of some types of model, and never, in principle, to a perfectly understood intelligent system, especially if AI reaches the point of recursive self improvement.

One important point (I think) is whether the cause or outcome of the box can be understood or predicted without full emulation of the entire box. Can it be distilled down to a more simple set of rules, or is it a chaotic system that turns into a different system if any part of it is removed?

That is, can you trace unequivocally the reason an LLM produced a certain token without, in effect, recreating the LLM and asking it the same question again?

This is much more similar to the technique of obfuscating encryption algorithms for DRM schemes that I believe is often called "white-box cryptography".
So going by your definition what would be a true black box?
A starting point would be a system that does not require the use of a limited set of pre-defined operations to transform from one state to another state via the interpretation of a set of pre-existing instructions. This rules out any digital system entirely.
But what _would_ qualify? The point being made is that your definition is so constricting as to be useless. Nothing (sans perhaps true physical limit-conditions, like black-holes) would be a black box under your definition.
It's really only constricting to state machines which are dependent upon a fixed instruction set to function.