| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by killerstorm 1187 days ago

> The mechanism is entirely concerned with retaining semantic context and semantic "global dependencies" spanning the entire input and output.

This is not quite true: GPT, specifically, is auto-regressive. It computes things only looking back, not forward.

Given that each token has only a fixed computing budget, it is likely that GPT precomputes information which will be relevant to later tokens, to be routed via attention.

In fact, this effect was demonstrated in practice: e.g. in a prompt like "Question: Where is the Eiffel tower located? Answer: " people found that information about "Paris" is routed from tokens "Eiffel tower", i.e. this associative memory was looked up earlier than it was needed.

So I was answering from that perspective: it can do better if it knows what to pre-compute.