|
|
|
|
|
by gwern
2204 days ago
|
|
LSTMs typically forget after more than a few hundred tokens (vanishing gradients?), so while you could probably BPTT 2000+ steps these days, there wouldn't be much point. > I'm sure somebody somewhere is working to make a transformer with recurrency. The neural turing machine mentioned in another comment is such an example but it seems to have been abandoned. Yeah, there's a bunch of Transformer variants which either use recurrency, compression for long-range, or efficient attention approximation for windows so large as to obviate recurrency. The NTM hasn't been shown useless so much as alternatives like Transformers proven to be way easier to implement & scale up to get similar performance, but it pops up occasionally; a particularly surprising recent appearance was Nvidia's GameGAN which uses a NTM-like memory module for learning to model Pac-Man: https://nv-tlabs.github.io/gameGAN/ |
|