Hacker News new | ask | show | jobs
by akkishore 757 days ago
Hi Andrej,

Huge fan of all the work you do. Wanted to understand something fundamental and whom better to ask than you: Whats so special about the transformer architecture that its able to predict the next token so beautifully understanding all the intricate previous token relationships? I understand Attention but what so special about this architecture that no other architectures are able to "attend" appropriately to previous tokens? Being a CS guy, its really hard for me to fathom that we have not yet created another architecture which can perform similarly.

1 comments

Transformers have quadratic computational complexity in sequence length, i.e. O(N^2) where N is the sequence length. RNNs, Linformer, Mamba, etc. have linear or quasi-linear computational complexity in sequence length, which often bottlenecks information movement across tokens.

In theory, if you grew the RNN's state quadratically vs sequence length, you could likely achieve comparable performance to transformers, but it would likely be less efficient than transformers.