| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by alquemist 2034 days ago
	FWIW, transformers is to sequences what convnets is to grids, modulo important considerations like kernel size and normalization. Think of transformers as really wide (N) and really short (1) convolutions. Both are instances of graphnets with a suitable neighbor function. Once normalization was cracked by transformers, all sort of interesting graphnets became possible, though it's possible that stacked k-dimensional convolutions are sufficient in practice.

1 comments

whimsicalism 2034 days ago

I work in the field, I don't need the difference explained to me.

> Think of transformers as really wide (N) and really short (1) convolutions

Modern transformer networks are not "really short" and you're also conflating the difference between intra- and inter- attention.

There is still a pitched battle being waged between convnets and transformers for sequences, although it looks like transformers have the upper hand accuracy wise right now, convnets are competitive speed-wise.

link