|
|
|
|
|
by causal
796 days ago
|
|
> you don't re-compute all attention passes for all tokens for each next token to predict. You don't? I imagine the attention maps could be pretty different between n and n+1 tokens. Edit: Or maybe you just meant you don't compute attention Σ(n) times for each new token? |
|