|
|
|
|
|
by alquemist
2034 days ago
|
|
FWIW, transformers is to sequences what convnets is to grids, modulo important considerations like kernel size and normalization. Think of transformers as really wide (N) and really short (1) convolutions. Both are instances of graphnets with a suitable neighbor function. Once normalization was cracked by transformers, all sort of interesting graphnets became possible, though it's possible that stacked k-dimensional convolutions are sufficient in practice. |
|
> Think of transformers as really wide (N) and really short (1) convolutions
Modern transformer networks are not "really short" and you're also conflating the difference between intra- and inter- attention.
There is still a pitched battle being waged between convnets and transformers for sequences, although it looks like transformers have the upper hand accuracy wise right now, convnets are competitive speed-wise.