| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lsy 454 days ago

Thanks for commenting, I like the example because it's simple enough to discuss. Isn't it more accurate to say not that Claude "realizes it's going to say astronomer" or "knows that it's going to say something that starts with a vowel" and more that the next token (or more pedantically, vector which gets reduced down to a token) is generated based on activations that correlate to the "astronomer" token, which is correlated to the "an" token, causing that to also be a more likely output?

I kind of see why it's easy to describe it colloquially as "planning" but it isn't really going ahead and then backtracking, it's almost indistinguishable from the computation that happens when the prompt is "What is the indefinite article to describe 'astronomer'?", i.e. the activation "astronomer" is already baked in by the prompt "someone who studies the stars", albeit at one level of indirection.

The distinction feels important to me because I think for most readers (based on other comments) the concept of "planning" seems to imply the discovery of some capacity for higher-order logical reasoning which is maybe overstating what happens here.

1 comments

cgdl 454 days ago

Thank you. In my mind, "planning" doesn’t necessarily imply higher-order reasoning but rather some form of search, ideally with backtracking. Of course, architecturally, we know that can’t happen during inference. Your example of the indefinite article is a great illustration of how this illusion of planning might occur. I wonder if anyone at Anthropic could compare the two cases (some sort of minimal/differential analysis) and share their insights.

link

colah3 454 days ago

I used the astronomer example earlier as the most simple, minimal version of something you might think of as a kind of microscopic form of "planning", but I think that at this point in the conversation, it's probably helpful to switch to the poetry example in our paper:

https://transformer-circuits.pub/2025/attribution-graphs/bio...

There are several interesting properties:

- Something you might characterize as "forward search" (generating candidates for the word at the end of the next line, given rhyming scheme and semantics)

- Representing those candidates in an abstract way (the features active are general features for those words, not "motor features" for just saying that word)

- Holding many competing/alternative candidates in parallel.

- Something you might characterize as "backward chaining", where you work backwards from these candidates to "write towards them".

With that said, I think it's easy for these arguments to fall into philosophical arguments about what things like "planning" mean. As long as we agree on what is going on mechanistically, I'm honestly pretty indifferent to what we call it. I spoke to a wide range of colleagues, including at other institutions, and there was pretty widespread agreement that "planning" was the most natural language. But I'm open to other suggestions!

link

pas 453 days ago

Thanks for linking to this semi-interactive thing, but ... it's completely incomprehensible. :o (edit: okay, after reading about CLT it's a bit less alien.)

I'm curious where is the state stored for this "planning". In a previous comment user lsy wrote "the activation >astronomer< is already baked in by the prompt", and it seems to me that when the model generates "like" (for rabbit) or "a" (for habit) those tokens already encode a high probability for what's coming after them, right?

So each token is shaping the probabilities for the successor ones. So that "like" or "a" has to be one that sustains the high activation of the "causal" feature, and so on, until the end of the line. Since both "like" and "a" are very very non-specific tokens it's likely that the "semantic" state is really resides in the preceding line, but of course gets smeared (?) over all the necessary tokens. (And that means beyond the end of the line, to avoid strange non-aesthetic but attract cool/funky (aesthetic) semantic repetitions (like "hare" or "bunny"), and so on, right?)

All of this is baked in during training, during inference time the same tokens activate the same successor tokens (not counting GPU/TPU scheduling randomness and whatnot) and even though there's a "loop" there's no algorithm to generate top N lines and pick the best (no working memory shuffling).

So if it's planning it's preplanned, right?

link

colah3 453 days ago

The planning is certainly performed by circuits which we learned during training.

I'd expect that, just like in the multi-step planning example, there are lots of places where the attribution graph we're observing is stitching together lots of circuits, such that it's better understood as a kind of "recombination" of fragments learned from many examples, rather than that there was something similar in the training data.

This is all very speculative, but:

- At the forward planning step, generating the candidate words seems like it's an intersection of the semantics and rhyming scheme. The model wouldn't need to have seen that intersection before -- the mechanism could easily piece examples independently building the pathway for the semantics, and the pathway for the rhyming scheme

- At the backward chaining step, many of the features for constructing sentence fragments seem like the target is quite general (perhaps animals in one case, or others might even just be nouns).

link

cgdl 453 days ago

Thank you, this makes sense. I am thinking of this as an abstraction/refinement process where an abstract notion of the longer completion is refined into a cogent whole that satisfies the notion of a good completion. I look forward to reading your paper to understand the "backward chaining" aspect and the evidence for it.

link

miraculixx 450 days ago

To plan: to think about and decide what you are going to do or how you are going to do something (Cambridge Dictionary)

That implies hire-other reasoning. If the model does not do that, which it doesn't, that's quite simply the wrong term.

link