Y
Hacker News
new
|
ask
|
show
|
jobs
by
xcv123
807 days ago
The way I understood it is that for each token, the attention mechanism itself consumes a fixed amount of processor time.
The innovation here is to prioritize tokens so that some tokens have more or less processor time.