|
|
|
|
|
by akkishore
757 days ago
|
|
Hi Andrej, Huge fan of all the work you do. Wanted to understand something fundamental and whom better to ask than you: Whats so special about the transformer architecture that its able to predict the next token so beautifully understanding all the intricate previous token relationships? I understand Attention but what so special about this architecture that no other architectures are able to "attend" appropriately to previous tokens? Being a CS guy, its really hard for me to fathom that we have not yet created another architecture which can perform similarly. |
|
In theory, if you grew the RNN's state quadratically vs sequence length, you could likely achieve comparable performance to transformers, but it would likely be less efficient than transformers.