| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by colah3 447 days ago

I used the astronomer example earlier as the most simple, minimal version of something you might think of as a kind of microscopic form of "planning", but I think that at this point in the conversation, it's probably helpful to switch to the poetry example in our paper:

https://transformer-circuits.pub/2025/attribution-graphs/bio...

There are several interesting properties:

- Something you might characterize as "forward search" (generating candidates for the word at the end of the next line, given rhyming scheme and semantics)

- Representing those candidates in an abstract way (the features active are general features for those words, not "motor features" for just saying that word)

- Holding many competing/alternative candidates in parallel.

- Something you might characterize as "backward chaining", where you work backwards from these candidates to "write towards them".

With that said, I think it's easy for these arguments to fall into philosophical arguments about what things like "planning" mean. As long as we agree on what is going on mechanistically, I'm honestly pretty indifferent to what we call it. I spoke to a wide range of colleagues, including at other institutions, and there was pretty widespread agreement that "planning" was the most natural language. But I'm open to other suggestions!

2 comments

pas 447 days ago

Thanks for linking to this semi-interactive thing, but ... it's completely incomprehensible. :o (edit: okay, after reading about CLT it's a bit less alien.)

I'm curious where is the state stored for this "planning". In a previous comment user lsy wrote "the activation >astronomer< is already baked in by the prompt", and it seems to me that when the model generates "like" (for rabbit) or "a" (for habit) those tokens already encode a high probability for what's coming after them, right?

So each token is shaping the probabilities for the successor ones. So that "like" or "a" has to be one that sustains the high activation of the "causal" feature, and so on, until the end of the line. Since both "like" and "a" are very very non-specific tokens it's likely that the "semantic" state is really resides in the preceding line, but of course gets smeared (?) over all the necessary tokens. (And that means beyond the end of the line, to avoid strange non-aesthetic but attract cool/funky (aesthetic) semantic repetitions (like "hare" or "bunny"), and so on, right?)

All of this is baked in during training, during inference time the same tokens activate the same successor tokens (not counting GPU/TPU scheduling randomness and whatnot) and even though there's a "loop" there's no algorithm to generate top N lines and pick the best (no working memory shuffling).

So if it's planning it's preplanned, right?

link

colah3 447 days ago

The planning is certainly performed by circuits which we learned during training.

I'd expect that, just like in the multi-step planning example, there are lots of places where the attribution graph we're observing is stitching together lots of circuits, such that it's better understood as a kind of "recombination" of fragments learned from many examples, rather than that there was something similar in the training data.

This is all very speculative, but:

- At the forward planning step, generating the candidate words seems like it's an intersection of the semantics and rhyming scheme. The model wouldn't need to have seen that intersection before -- the mechanism could easily piece examples independently building the pathway for the semantics, and the pathway for the rhyming scheme

- At the backward chaining step, many of the features for constructing sentence fragments seem like the target is quite general (perhaps animals in one case, or others might even just be nouns).

link

cgdl 446 days ago

Thank you, this makes sense. I am thinking of this as an abstraction/refinement process where an abstract notion of the longer completion is refined into a cogent whole that satisfies the notion of a good completion. I look forward to reading your paper to understand the "backward chaining" aspect and the evidence for it.

link