|
|
|
|
|
by colah3
449 days ago
|
|
> The obvious way to deal with this would be to send forward some of the internal activations as well as the generated words in the autoregressive chain. Hi! I lead interpretability research at Anthropic. That's a great intuition, and in fact the transformer architecture actually does exactly what you suggest! Activations from earlier time steps are sent forward to later time steps via attention. (This is another thing that's lost in the "models just predict the next word" framing.) This actually has interesting practical implications -- for example, in some sense, it's the deep reason costs can sometimes be reduced via "prompt caching". |
|