| HN Mirror

- hashing followed by dot product in transformer you said

- you were doing dot products at each layer to introduce non-linearity in transformer (and neural nets in general). Polynomials are already non-linear, so you don't need that. Transformer and vw -interact are polynomials. Maybe the feedforward layers and skip connections are not actually needed.

- 12 layers ? vw -interact xxxxxxxxxxxxx is 12 layers. You need a lot of memory for that, but in principle vw interact can do any number of them

These results are coming from google and their massive compute resources. If they ran vw with -interact x^13 they might get similar results.

We're really talking about polynomial approximation here, both transformer and vw used in this way. And that is in theory able to approximate any continuous function (just like neural networks).