Hacker News new | ask | show | jobs
by xcv123 807 days ago
The way I understood it is that for each token, the attention mechanism itself consumes a fixed amount of processor time.

The innovation here is to prioritize tokens so that some tokens have more or less processor time.