|
|
|
|
|
by Philpax
447 days ago
|
|
> The only thing that survives from one iteration to the next is the singular highest ranking token, the entire state and "thought process" of the network cannot be represented by a single token, which means that every strategy is encoded in it during training, as a lossy representation of the training data. This is not true. The key-values of previous tokens encode computation that can be accessed by attention, as mentioned by colah3 here: https://news.ycombinator.com/item?id=43499819 You may find https://transformer-circuits.pub/2021/framework/index.html useful. |
|
The whitepaper you linked is a great one, I was all over it a few years back when we built our first models. It should be recommended reading for anyone interested in CS.