Hacker News new | ask | show | jobs
by dekhn 722 days ago
Are RNNs completely subsumed by transformers? IE, can I forget about learning anything about how to work with RNNs, and instead focus on transformers?
3 comments

To further problematize this question (which I don't feel like I can actually answer), consider this paper: "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" - https://arxiv.org/pdf/2006.16236

What this shows is that actually a specific narrow definition of transformer (a transformer with "causal masking" - see paper) is equivalent to an RNN, and vice versa.

Similarly Mamba (https://arxiv.org/abs/2312.00752), the other hot architecture at the moment, has an equivalent unit to a gated RNN. For performance reasons, I believe they use an equivalent CNN during training and an RNN during inference!

There still are important distinctions. RNNs have constant memory while transformers expand their memory with each new token. They are related, but one could in theory process an unbounded sequence while the other cannot because of growing memory usage.
To be more concrete: you might decide not to learn about RNNs, but still find them lurking in the things you did learn about!
Transformers have finite context, RNNs don’t. In practice the RNN gradient signal is limited by back propagation through time, it decays. This is in fact the whole selling point of transformers; association is not harder or easier in near/short distance. But in theory a RNN can remember infinitely far away.
Not if you want to be a PhD/Researcher in ML, yes otherwise.

Source: Working on ML/LLMs as a research engineer for the past 7 years, including for one of the FAANG's research lab, always wanted to take time to learn about RNN but never did and never needed to.

Oh, I'm sure plenty of recent PhDs don't know about RNNs. They've been dropped like a hot potato in the last 4-5 years.
I think to do pure research it’s definitely worth knowing about the big ideas of the past, why we moved on from them, what we learned etc.
I haven’t read it in a while but I remember this post giving a good rundown of rnns

https://dennybritz.com/posts/wildml/recurrent-neural-network...

None of the students who have taken the classes I TA pass w/I learning about RNNs.
Is that true also of LSTMs?
Yes. We cover Jordan and Elman RNN, LSTMs, and GRUs. Assignments only really test for LSTM knowledge, though.
Thanks. The reason I asked the question is that I've struggled to understand RNNs and other networks (compared to MLPs, CNNs, and transformers) due to the subtlety of their design and my hope was that I could simply forget about them.

I'm surprised about only testing for LSTMs- of all the sequence/memory models, they seem like the most arbitrary and hacky, but I've never been able to determine if that's simply because I don't understand those types of models (my training is in HMMs- do you teach/test those?)

No, we don't teach HMMs (although that would be super cool). It's strictly a neural networks class.

A lot of my research has focused on LSTMs, and so I am partial to them. I think they are super useful and have a lot of properties, but frankly speaking if you had to choose one architectures of the ones you mentioned, LSTMs/RNNs are probably the most OK to skip.

That said, if you just look at a simple RNN like the Jordan RNNs and focus on understanding that, then LSTMs just become fancy RNNs with some forgetting and remembering logic.