| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lonk11 710 days ago
	My understanding is that the attention in all transformer layers is "causal" - that is the output of a transformer layer for token N depends only on tokens from 0 to N. This means that every attention layer can use previously calculated outputs for the same prompt prefix. So it only needs to calculate from scratch starting from the first unique token in the prompt sequence.