Hacker News new | ask | show | jobs
by PartiallyTyped 984 days ago
I mean ... if you think about it, attention changes the effective weights of a model.

I am fairly certain that if you try, you can show that for any particular sequence of tokens of length N, the N-1 tokens induce a residual FFNN that results in exactly the same distribution over the next tokens given just the Nth.

2 comments

You may be interested in "Linear Transformers Are Secretly Fast Weight Programmers": https://arxiv.org/abs/2102.11174
Seems very similar to "Language Models Implicitly Perform Gradient Descent as Meta-Optimizers"

https://arxiv.org/abs/2212.10559

Only superficially.

You should give the Fast Weight Programmers paper a chance, and a thorough reading. It sounds like you already appreciate a fair bit of its main point.

The best part about the FWP paper is the derivation of an FWP equation from the transformer equation. It's remarkably straightforward. You remove the softmax operation (i.e. linearize) and the rest is just algebraic manipulation -- a formal proof.

Transformers are just NNs that learn to control a Content-Addressable Memory (CAM).

This perspective has far-reaching implications for ML, sort of like category theory did for metamathematics and type theory. For example, LSTM cells can be viewed as a NN that learns to control a flip-flop (the "deluxe" kind found on FPGAs, with output-enable, clock-enable, and reset inputs). I've found that this is by far the easiest way to explain LSTM to people. It also raises the obvious question of what other kinds of simple blocks can be controlled by NNs. I think this question will lead to another wave of breakthroughs.

The ultimate limit of this approach is the Gödel Machine -- although no attempt to build one has come anywhere close to success yet.

Sounds interesting, try it and share your results here :)