Hacker News new | ask | show | jobs
by causal 796 days ago
> you don't re-compute all attention passes for all tokens for each next token to predict.

You don't? I imagine the attention maps could be pretty different between n and n+1 tokens.

Edit: Or maybe you just meant you don't compute attention Σ(n) times for each new token?