| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by dekhn 722 days ago
	Are RNNs completely subsumed by transformers? IE, can I forget about learning anything about how to work with RNNs, and instead focus on transformers?

3 comments

Fripplebubby 721 days ago

To further problematize this question (which I don't feel like I can actually answer), consider this paper: "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention" - https://arxiv.org/pdf/2006.16236

What this shows is that actually a specific narrow definition of transformer (a transformer with "causal masking" - see paper) is equivalent to an RNN, and vice versa.

Similarly Mamba (https://arxiv.org/abs/2312.00752), the other hot architecture at the moment, has an equivalent unit to a gated RNN. For performance reasons, I believe they use an equivalent CNN during training and an RNN during inference!

link

visarga 721 days ago

There still are important distinctions. RNNs have constant memory while transformers expand their memory with each new token. They are related, but one could in theory process an unbounded sequence while the other cannot because of growing memory usage.

link

Fripplebubby 721 days ago

To be more concrete: you might decide not to learn about RNNs, but still find them lurking in the things you did learn about!

link

toxik 721 days ago

Transformers have finite context, RNNs don’t. In practice the RNN gradient signal is limited by back propagation through time, it decays. This is in fact the whole selling point of transformers; association is not harder or easier in near/short distance. But in theory a RNN can remember infinitely far away.

link

sailingparrot 722 days ago

Not if you want to be a PhD/Researcher in ML, yes otherwise.

Source: Working on ML/LLMs as a research engineer for the past 7 years, including for one of the FAANG's research lab, always wanted to take time to learn about RNN but never did and never needed to.

link

rolisz 721 days ago

Oh, I'm sure plenty of recent PhDs don't know about RNNs. They've been dropped like a hot potato in the last 4-5 years.

link

sailingparrot 721 days ago

I think to do pure research it’s definitely worth knowing about the big ideas of the past, why we moved on from them, what we learned etc.

link

derangedHorse 721 days ago

I haven’t read it in a while but I remember this post giving a good rundown of rnns

https://dennybritz.com/posts/wildml/recurrent-neural-network...

link

jszymborski 721 days ago

None of the students who have taken the classes I TA pass w/I learning about RNNs.

link

dekhn 721 days ago

Is that true also of LSTMs?

link

jszymborski 720 days ago

Yes. We cover Jordan and Elman RNN, LSTMs, and GRUs. Assignments only really test for LSTM knowledge, though.

link

dekhn 719 days ago

Thanks. The reason I asked the question is that I've struggled to understand RNNs and other networks (compared to MLPs, CNNs, and transformers) due to the subtlety of their design and my hope was that I could simply forget about them.

I'm surprised about only testing for LSTMs- of all the sequence/memory models, they seem like the most arbitrary and hacky, but I've never been able to determine if that's simply because I don't understand those types of models (my training is in HMMs- do you teach/test those?)

link

jszymborski 716 days ago

No, we don't teach HMMs (although that would be super cool). It's strictly a neural networks class.

A lot of my research has focused on LSTMs, and so I am partial to them. I think they are super useful and have a lot of properties, but frankly speaking if you had to choose one architectures of the ones you mentioned, LSTMs/RNNs are probably the most OK to skip.

That said, if you just look at a simple RNN like the Jordan RNNs and focus on understanding that, then LSTMs just become fancy RNNs with some forgetting and remembering logic.

link