|
|
|
|
|
by jgehring
794 days ago
|
|
That's what happens in the very last layer. But at that point the embedding for "was" got enriched multiple times, i.e., in each attention pass, with information from the whole context (which is the whole novel here). So for the example, it would contain the information to predict, let's say, the first token of the first name of the murderer. Expanding on that, you could imagine that the intent of the sentence to complete (figuring out the murderer) would have to be captured in the first attention passes so that other layers would then be able to integrate more and more context in order to extract that information from the whole context. Also, it means that the forward passes for previous tokens need to have extracted enough salient high-level information already since you don't re-compute all attention passes for all tokens for each next token to predict. |
|
You don't? I imagine the attention maps could be pretty different between n and n+1 tokens.
Edit: Or maybe you just meant you don't compute attention Σ(n) times for each new token?