Hacker News new | ask | show | jobs
by inciampati 1226 days ago
Practical attention implementations don't work over arbitrary length sequences. The universal approximation theorem holds IMO. Information will mix as you go through fully connected MLP layers. Attention is apparently a prior structure that is needed to really reduce training costs.