| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by xcv123 807 days ago
	The way I understood it is that for each token, the attention mechanism itself consumes a fixed amount of processor time. The innovation here is to prioritize tokens so that some tokens have more or less processor time.