| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by inciampati 1226 days ago
	Practical attention implementations don't work over arbitrary length sequences. The universal approximation theorem holds IMO. Information will mix as you go through fully connected MLP layers. Attention is apparently a prior structure that is needed to really reduce training costs.