Hacker News new | ask | show | jobs
by Philpax 447 days ago
> The only thing that survives from one iteration to the next is the singular highest ranking token, the entire state and "thought process" of the network cannot be represented by a single token, which means that every strategy is encoded in it during training, as a lossy representation of the training data.

This is not true. The key-values of previous tokens encode computation that can be accessed by attention, as mentioned by colah3 here: https://news.ycombinator.com/item?id=43499819

You may find https://transformer-circuits.pub/2021/framework/index.html useful.

1 comments

This is a optimization to prevent redundant calculations. If it was not performed the result would be the same, just served slightly slower.

The whitepaper you linked is a great one, I was all over it a few years back when we built our first models. It should be recommended reading for anyone interested in CS.