Hacker News new | ask | show | jobs
LLMs are not the black box you were promised (jay.ai)
56 points by _jayhack_ 13 days ago
11 comments

LLM written article. It's also not accurate; the fact that language models have human-interpretable representations and neurons has been known since BERT.

Circuits research also does not come from Anthropic. Mech interp is a huge field in academia and most of the core circuit analysis papers were from OpenAI/GDM/academia. However, Anthropic tends to produce a lot of blog posts where they draw poorly supported but hype-able analogies between LLMs and biological intelligence. It's wild.

For a better understanding of mech interp and circuits, including what we actually do know about LLM internals, I would recommend reading this paper: https://arxiv.org/pdf/2501.16496

Hello, I am the author - this is not an LLM-generated article, I wrote this by hand and had an LLM adapt it from a thread on X. You can see the original thread here: https://x.com/mathemagic1an/status/2035850046735098065

> the fact that language models have human-interpretable representations and neurons has been known since BERT... Circuits research also does not come from Anthropic... The article does not claim Anthropic invented the field, rather that they have had important contributions to it. This is intended as an overview into a specific set of ideas that are working for mechanistic interpretability. Not a formal literature review.

I don't recall being promised a black box. Are we certain llms didn't write this article and just came up with one of those It's Not Whatever-random-thing pithy zinger kind of things that they're prone to?
Deep Learning is a Black Box and Here's Why You Should Use Random Forests Because They Are Interpretable

This was the mantra of applied machine learning c. 2010 - 2024 for anyone paying attention. No longer the case.

If you asked me last year(2025) I would have still said LLMs are a silly toy.

As of Jan 2026 I have come to accept that LLMs are at least part of the puzzle of how intelligence works. They are at this point better than the majority of humans at various intellectual tasks. It may not be or ever be a 1:1 but good enough ran the world already before llms.

There is not even a formal definition of what intelligence is so saying LLM's are intelligent can't even be "right/wrong". Its just arguing semantics and definitions.

They are better than humans at tasks that require information recall and application to specific task.

For example, front end web app layout and basic functionality. Anyone can make a website with interactive buttons with ease now, where as before, you had to go look up examples, try stuff, figure out why its not working, e.t.c.

But in terms of organization and higher level tasks, like for example making front end that is clean, robust, easily extensible, and doesn't break, LLMs require almost as much prompting to do this as it takes to actually write the code.

They are mostly "faster" than the majority of humans. They are rarely better than experienced and talented humans at the majority of tasks they are able to do. They are better on both scales on a small thin slice of work tasks.
They're not better than the best humans at practically anything. However I doubt there's a person alive that could outperform an LLM on a broad suite of tasks like Humanity's Last Exam and the vast majority of people probably couldn't answer a single question on it.
They’re the language part of the puzzle, which seems to require some basic world modeling but it can’t make novel models unless there’s an example in its training data.

I think engineering and mathematical thought requires spatial reasoning, when I model problems I see them as 3D shapes. Like the economy is a series of tubes that money flows through and collect in buckets, programming state is little boxes that hold values, chemical interactions are like keys that fit into locks.

I don’t think LLMs can build models like that, but because it has so much memorized and there usually isn’t a need for a novel model custom fit for a problem, it can fake it by imitation.

Seems like LLMs are that. A bunch of most probable word associations is a network, and you can build a physical model of a network, or build a network that allows you to reason about a physical model. Whether it's just a flowchart or workflow diagram, or an X-dimensional matrix with vectors moving through it.

But the only way to map the network in an LLM is experimentally. You have to prompt it, and see how the coefficients fall in order to construct your most likely walk through the training data.

I think that LLMs can and do come up with novel things through exhaustion, just by applying the relationships between some set of entities to entirely different sets of entities because an accumulation of earlier context pushed the probability of those entities being mentioned, and they were able to easily replace a selection of entities that were more associated with those nearer connective, relationship words.

I think that as such LLMs are good at generating metaphors, and a lot of innovation comes from going "What if As worked like Bs?" Just go through all the As and Bs, toss the ones that don't make any sense and test the ones that seem like they might.

I don't believe you can say that "LLM" is part of intelligence. No single human is exposed to as much text as any LLM model ingests, not by many orders of magnitude, and humans still perform cognition and generate new language.
Most LLMs are multimodal now, able to map visual concepts to language and vice versa. If OpenAI's recent Erdos solution was faking math, it faked it very well.
3D isn’t one of the modes though, I know a paper several years back showed that diffusion models don’t actually understand physics or geometry.

I can’t evaluate the Erdos solution personally, but both math and software have many problems that are some combination of other problems and since it can get instant verification feedback it can try millions of permutations to discover the right solution. This is valuable, I’m not dismissing it, but I think there’s another tier of harder problems that I don’t believe LLMs can solve and it will require some further theoretical breakthroughs to get there.

"How can we understand what an LLM is "thinking"? It's clearly very valuable to do so — it could enable steering model behavior, detecting dangerous intent, and more."

Well that is complete any utter bollocks, dribbled in para three or so, and obviously written by a next token guesser.

LLMs are tools and I'm pretty sure if I let you loose on some of my tools, you might lose an extremity unless I kept an eye on you.

I have an on prem Qwen3.6-35B-A3B-UD-Q4_K_XL working on a box in the office and its quite handy for a chat.

>On the Biology of a Large Language Model

Biology? Anthropic really needs to stop anthropomorphizing these things so much. I'm with Dijkstra on this one.

I know they do it as a sort of marketing but still...

Why, what did Dijkstra have to say about biology vs computer systems?

I didn't see him mentioned in the article and I can't recall what he ever said about biology before..

He strongly opposed to anthropomorphizing computing.
AI (assisted?) summary of the March 2025 Anthropic paper https://transformer-circuits.pub/2025/attribution-graphs/bio...
As the author - this was adapted from a thread posted on X in March (linked in article). AI did the adaptation, I wrote the original article. It seems like it inserted grammatically correct hyphens, otherwise the copy is mine.
It's nice to see sparse interpretable LLMs being made.

This is similar to factor rotation in factor analysis (or PCA). A varimax rotation, for example, can produce an equivalent factor analysis with sparse loadings, and which is generally more interpretable. Fortunately for us the world is not just a complete mess, and sparse loadings can often be found. There seem to be "natural" concepts that we have observed rather than invented.

(Many examples in other simple machine learning methods too, I am sure.)

"lack of metacognitive insight" is interesting, because I have observed people acting this way too. I have even observed it in myself
A while back during a particularly rough patch where everything was going wrong, I started thinking, "man, I really hope I'm being stupid and doing it wrong..." (because then I can stop doing that!)

And wouldn't you know it, I keep getting my wish :)

I believe it's a great deal worse than that. All the metacognitive insight we do have may just be confabulation and we are fooled into believing that we have it because the process for conjuring it is good at finding a plausible answer.

When you read about and observe the split-brain patient experiments the appropriate response is abject horror at the implications.

I love the split-brain patient experiments!
It is a characteristic of neural nets that they do not have insight into their own functioning.

It is arguably a characteristic of any intelligent system, that at least some part of it must be opaque to itself, but the previous sentence is more defensible than a generalized claim.

If you don't understand what that means, tell me from your own metacognitive insight what parts of your brain are being used to read this. Not because of learned knowledge about what parts of the brain do what, through your own insight in your own functioning. You can't, because you don't have any.

This isn't just that human rationalize a lot. This is below that. This is that even if you notice yourself rationalizing, which is something you can train yourself to do, you have no access to the underlying computations/processes of the rationalization itself, or the process of noticing you are rationalizing.

There is arguably still a sense that we experience in which we humans could reasonably say "No, I'm pretty sure I used addition-with-carry to answer you", so that is perhaps not the easiest example to think about the experience of. But there will always be some question of "how did you do that" to which you can give no answer because the answer is in the firing of the neural net itself and you, who is in one way or another the product of that firing, do not have access to that. How did you quickly catch that ball that someone unexpectedly threw at you? You just did, as far your neural net is concerned.

(Also, while I've expressed this in terms of your conscious experience, this doesn't have anything to do with "consciousness". Neural nets in general do not get this feedback and do not and can not have arbitrary metacognition about their own functioning. This is an artifact of my writing text to address conscious beings.)

Yep. An effect cannot in itself reason about its cause. Some effects can suggest causes though. For example you tend to have memory of an algorithm you just executed, when the usage was at least somewhat conscious or intended, which can lead you to be able to guess which algorithm you used (and potentially even correctly)
I think that lack is overwhelmingly the norm in humans. I can't tell you which neurons fired in my brain when I add two small integers.
As trivial as that example is, it boggles my mind just how large the scale gets of things about me I do not fully understand or cannot explain. It feels different than losing track of what I just did, because a memory of it seemingly never existed to lose in the first place. For example, as much as I can try to reason about executive dysfunction, I cannot seem to understand the real actual equation that results in me being willing or not to do something. It just feels like my own brain disagrees with me, and that's so frustrating, and I've been trying to rationalize it for years but in the end I just do not know, and likely cannot know.

It's not even trivial to identify what it is exactly I'm not aware of. There's just some pattern I don't like, and the factors that influence it are a mystery. I've discovered some things over the years that seem to correlate with it, but nothing that truly explains or remedies it.

Isn't that a metacognitive insight itself?
It is (especially the self-reflection).
I am not incapable of metacognition, but I still have observed (and keep observing) cases where I seem to lack real metacognitive insight into something.
What if your theory about how you reason turns out to be different from how you actually reason?
Well, I'll probably never know if so. In any case, my theory of reasoning works well enough to serve its purpose, I'd think.
far as i can tell, LLMs are approaching the mythical pzombie
Do the same principles apply to diffusion-based modeling?
just curious.. are there languages that are better or more efficient to build LLM's with other than English?
For some definitions of better, yes. Chinese is more token efficient for representing fixed text, for example, although this does not always lead to better performance on downstream tasks.
True. I suspect it's still hard to tell whether the bottleneck is the language itself, the tokenizer, or just the overwhelming amount of English training data.
> Ask it "what is the capital of the state containing Dallas" and you can observe, in order:

> the Dallas feature goes active,

> which causes the Texas feature to light up,

> which then causes Austin to light up.

> It seems fairly clear that this is tracing semantic relationships between high-level concepts — and in doing so, performing a kind of pseudo-symbolic inference, similar to what some philosophers would describe as "higher reasoning."

Uhhh no reasoning is required for Austin to follow Texas after Dallas, let alone "higher reasoning".

This is really grasping as straws

Even better, I just tried in chatgpt and it just googled it and told me. That’s not reasoning, that’s offloading a task and taking five times as long and way more energy than if I just typed it into Google myself.
Only some animals can perform a reasoning chain that long. Thus higher reasoning.