To further problematize this question (which I don't feel like I can actually answer), consider this paper: "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" - https://arxiv.org/pdf/2006.16236
What this shows is that actually a specific narrow definition of transformer (a transformer with "causal masking" - see paper) is equivalent to an RNN, and vice versa.
Similarly Mamba (https://arxiv.org/abs/2312.00752), the other hot architecture at the moment, has an equivalent unit to a gated RNN. For performance reasons, I believe they use an equivalent CNN during training and an RNN during inference!
There still are important distinctions. RNNs have constant memory while transformers expand their memory with each new token. They are related, but one could in theory process an unbounded sequence while the other cannot because of growing memory usage.
Transformers have finite context, RNNs don’t. In practice the RNN gradient signal is limited by back propagation through time, it decays. This is in fact the whole selling point of transformers; association is not harder or easier in near/short distance. But in theory a RNN can remember infinitely far away.
Not if you want to be a PhD/Researcher in ML, yes otherwise.
Source: Working on ML/LLMs as a research engineer for the past 7 years, including for one of the FAANG's research lab, always wanted to take time to learn about RNN but never did and never needed to.
Thanks. The reason I asked the question is that I've struggled to understand RNNs and other networks (compared to MLPs, CNNs, and transformers) due to the subtlety of their design and my hope was that I could simply forget about them.
I'm surprised about only testing for LSTMs- of all the sequence/memory models, they seem like the most arbitrary and hacky, but I've never been able to determine if that's simply because I don't understand those types of models (my training is in HMMs- do you teach/test those?)
No, we don't teach HMMs (although that would be super cool). It's strictly a neural networks class.
A lot of my research has focused on LSTMs, and so I am partial to them. I think they are super useful and have a lot of properties, but frankly speaking if you had to choose one architectures of the ones you mentioned, LSTMs/RNNs are probably the most OK to skip.
That said, if you just look at a simple RNN like the Jordan RNNs and focus on understanding that, then LSTMs just become fancy RNNs with some forgetting and remembering logic.
What this shows is that actually a specific narrow definition of transformer (a transformer with "causal masking" - see paper) is equivalent to an RNN, and vice versa.
Similarly Mamba (https://arxiv.org/abs/2312.00752), the other hot architecture at the moment, has an equivalent unit to a gated RNN. For performance reasons, I believe they use an equivalent CNN during training and an RNN during inference!