|
|
|
|
|
by versteegen
777 days ago
|
|
The internal state at layer M of token N is available at every following token > N and layer > M via attention heads. Transformed by a matrix but a very direct lookup mechanism. The state after the final attention layer is not addressable in this way, but it immediately becomes the output token which is of course accessible. Note also that sequential computations such as loops translate nicely to parallel ones, e.g. k layers can search the paths of length k in a graph, if each token represents one node. But since each token can only look backwards, unless you're searching a DAG you'd also have to feed in the graph multiple times so the nodes can see each other. Hmm... that might be a useful LLM prompting technique. |
|