|
|
|
|
|
by ludwigschubert
1643 days ago
|
|
Still reading, but an early highlight for me is their rewriting of the attention mechanism in terms of tensor products. (Section “Attention Heads as Information Movement”) I always found the original formulation in terms of query/key/value/output matrices quite hard to operationally make sense of. It’s extremely exciting to see more “mechanistic interpretability”/“circuit analysis”-style work on ML architectures that are not convolutional neural networks! |
|