Hacker News new | ask | show | jobs
by ludwigschubert 1643 days ago
Still reading, but an early highlight for me is their rewriting of the attention mechanism in terms of tensor products. (Section “Attention Heads as Information Movement”) I always found the original formulation in terms of query/key/value/output matrices quite hard to operationally make sense of.

It’s extremely exciting to see more “mechanistic interpretability”/“circuit analysis”-style work on ML architectures that are not convolutional neural networks!