| HN Mirror

Only superficially.

You should give the Fast Weight Programmers paper a chance, and a thorough reading. It sounds like you already appreciate a fair bit of its main point.

The best part about the FWP paper is the derivation of an FWP equation from the transformer equation. It's remarkably straightforward. You remove the softmax operation (i.e. linearize) and the rest is just algebraic manipulation -- a formal proof.

Transformers are just NNs that learn to control a Content-Addressable Memory (CAM).

This perspective has far-reaching implications for ML, sort of like category theory did for metamathematics and type theory. For example, LSTM cells can be viewed as a NN that learns to control a flip-flop (the "deluxe" kind found on FPGAs, with output-enable, clock-enable, and reset inputs). I've found that this is by far the easiest way to explain LSTM to people. It also raises the obvious question of what other kinds of simple blocks can be controlled by NNs. I think this question will lead to another wave of breakthroughs.

The ultimate limit of this approach is the Gödel Machine -- although no attempt to build one has come anywhere close to success yet.