Hacker News new | ask | show | jobs
by tshadley 384 days ago
The goal here is not to replace transformers but combine them with RNN so you get both good short-term memory (self-attention) and much improved long-term memory (ATLAS recurrent memory).

"Empirically, our models—OmegaNet, Atlas, DeepTransformers, and Dot—achieve consistent improvements over Transformers and recent RNN variants across diverse benchmarks."