| I used the astronomer example earlier as the most simple, minimal version of something you might think of as a kind of microscopic form of "planning", but I think that at this point in the conversation, it's probably helpful to switch to the poetry example in our paper: https://transformer-circuits.pub/2025/attribution-graphs/bio... There are several interesting properties: - Something you might characterize as "forward search" (generating candidates for the word at the end of the next line, given rhyming scheme and semantics) - Representing those candidates in an abstract way (the features active are general features for those words, not "motor features" for just saying that word) - Holding many competing/alternative candidates in parallel. - Something you might characterize as "backward chaining", where you work backwards from these candidates to "write towards them". With that said, I think it's easy for these arguments to fall into philosophical arguments about what things like "planning" mean. As long as we agree on what is going on mechanistically, I'm honestly pretty indifferent to what we call it. I spoke to a wide range of colleagues, including at other institutions, and there was pretty widespread agreement that "planning" was the most natural language. But I'm open to other suggestions! |
I'm curious where is the state stored for this "planning". In a previous comment user lsy wrote "the activation >astronomer< is already baked in by the prompt", and it seems to me that when the model generates "like" (for rabbit) or "a" (for habit) those tokens already encode a high probability for what's coming after them, right?
So each token is shaping the probabilities for the successor ones. So that "like" or "a" has to be one that sustains the high activation of the "causal" feature, and so on, until the end of the line. Since both "like" and "a" are very very non-specific tokens it's likely that the "semantic" state is really resides in the preceding line, but of course gets smeared (?) over all the necessary tokens. (And that means beyond the end of the line, to avoid strange non-aesthetic but attract cool/funky (aesthetic) semantic repetitions (like "hare" or "bunny"), and so on, right?)
All of this is baked in during training, during inference time the same tokens activate the same successor tokens (not counting GPU/TPU scheduling randomness and whatnot) and even though there's a "loop" there's no algorithm to generate top N lines and pick the best (no working memory shuffling).
So if it's planning it's preplanned, right?