| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by causal 796 days ago

> you don't re-compute all attention passes for all tokens for each next token to predict.

You don't? I imagine the attention maps could be pretty different between n and n+1 tokens.

Edit: Or maybe you just meant you don't compute attention Σ(n) times for each new token?