Hacker News new | ask | show | jobs
by gwern 2204 days ago
LSTMs typically forget after more than a few hundred tokens (vanishing gradients?), so while you could probably BPTT 2000+ steps these days, there wouldn't be much point.

> I'm sure somebody somewhere is working to make a transformer with recurrency. The neural turing machine mentioned in another comment is such an example but it seems to have been abandoned.

Yeah, there's a bunch of Transformer variants which either use recurrency, compression for long-range, or efficient attention approximation for windows so large as to obviate recurrency. The NTM hasn't been shown useless so much as alternatives like Transformers proven to be way easier to implement & scale up to get similar performance, but it pops up occasionally; a particularly surprising recent appearance was Nvidia's GameGAN which uses a NTM-like memory module for learning to model Pac-Man: https://nv-tlabs.github.io/gameGAN/

1 comments

I've recently read a paper, that enables very long unrolls in RNNs due to O(1) memory requirements (in number of unroll steps): https://arxiv.org/pdf/2005.11362.pdf