Hacker News new | ask | show | jobs
by rkwasny 1252 days ago
Turns out it does not matter if you have transformer/MLP/lstm or whatever, as long as there are enough parameters and training epochs over large dataset things "just work"
2 comments

This isn't true - the model architecture matters a lot.

In general RNNs cannot handle long term dependencies (ie, long pieces of text) because the gradient vanishes. It's unclear how this solves this problem although they do reference the "attention free transformer" paper: https://arxiv.org/abs/2105.14103

The key component is the linear attention[1] and residual connections.

[1] https://arxiv.org/abs/2006.16236

> Transformers achieve remarkable performance in several tasks but due to their quadratic complexity, with respect to the input's length, they are prohibitively slow for very long sequences. To address this limitation, we express the self-attention as a linear dot-product of kernel feature maps and make use of the associativity property of matrix products to reduce the complexity from (N2) to (N), where N is the sequence length. We show that this formulation permits an iterative implementation that dramatically accelerates autoregressive transformers and reveals their relationship to recurrent neural networks. Our linear transformers achieve similar performance to vanilla transformers and they are up to 4000x faster on autoregressive prediction of very long sequences.

I believe it's because you train it in GPT-mode and then only use RNN-mode for inference.
To some degree, because we keep recreating the truly essential components the crude "Turing machine completeness" way. In time as we analyze the resulting models, we may find what patterns emerge and optimize for them. The result will be smaller, faster models that perform like larger slower ones.