| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gfysfm 387 days ago
	Hi, I wrote the post. Thank you! That’s how it does work, but unfortunately denoising the last paragraph requires computing attention scores for every token in that paragraph, which requires checking those tokens against every token in the sequence. So it’s still much less cacheable than the equivalent autoregressive model.