|
|
|
|
|
by visarga
2346 days ago
|
|
> This vector is equivalent to a 2 layer reformer. There is no feed forward layer, no skip connections and no layer normalization in VW. In the reformer, hashing is followed by dot products. In VW hashing just collides some tokens, followed by a linear layer. Also, 2 layers of transformer is a little shallow. In practice it's 12-14 layers or more. In order to be equivalent, there would need to be equally good results on translation from VW, but I've never seen it used for translation. I'm wondering why? |
|
- you were doing dot products at each layer to introduce non-linearity in transformer (and neural nets in general). Polynomials are already non-linear, so you don't need that. Transformer and vw -interact are polynomials. Maybe the feedforward layers and skip connections are not actually needed.
- 12 layers ? vw -interact xxxxxxxxxxxxx is 12 layers. You need a lot of memory for that, but in principle vw interact can do any number of them
These results are coming from google and their massive compute resources. If they ran vw with -interact x^13 they might get similar results.
We're really talking about polynomial approximation here, both transformer and vw used in this way. And that is in theory able to approximate any continuous function (just like neural networks).