| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by toxik 1199 days ago
	You get this issue without position embeddings. Attention computes an inner product between each pair of input tokens, so N^2 x E. Squares grow really fast.