|
|
|
|
|
by killerstorm
1187 days ago
|
|
> The mechanism is entirely concerned with retaining semantic context and semantic "global dependencies" spanning the entire input and output. This is not quite true: GPT, specifically, is auto-regressive. It computes things only looking back, not forward. Given that each token has only a fixed computing budget, it is likely that GPT precomputes information which will be relevant to later tokens, to be routed via attention. In fact, this effect was demonstrated in practice: e.g. in a prompt like "Question: Where is the Eiffel tower located? Answer: " people found that information about "Paris" is routed from tokens "Eiffel tower", i.e. this associative memory was looked up earlier than it was needed. So I was answering from that perspective: it can do better if it knows what to pre-compute. |
|