Hacker News new | ask | show | jobs
by Fripplebubby 715 days ago
To further problematize this question (which I don't feel like I can actually answer), consider this paper: "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" - https://arxiv.org/pdf/2006.16236

What this shows is that actually a specific narrow definition of transformer (a transformer with "causal masking" - see paper) is equivalent to an RNN, and vice versa.

Similarly Mamba (https://arxiv.org/abs/2312.00752), the other hot architecture at the moment, has an equivalent unit to a gated RNN. For performance reasons, I believe they use an equivalent CNN during training and an RNN during inference!

2 comments

There still are important distinctions. RNNs have constant memory while transformers expand their memory with each new token. They are related, but one could in theory process an unbounded sequence while the other cannot because of growing memory usage.
To be more concrete: you might decide not to learn about RNNs, but still find them lurking in the things you did learn about!