Hacker News new | ask | show | jobs
by smaddox 757 days ago
Transformers have quadratic computational complexity in sequence length, i.e. O(N^2) where N is the sequence length. RNNs, Linformer, Mamba, etc. have linear or quasi-linear computational complexity in sequence length, which often bottlenecks information movement across tokens.

In theory, if you grew the RNN's state quadratically vs sequence length, you could likely achieve comparable performance to transformers, but it would likely be less efficient than transformers.