|
|
|
|
|
by koayon
809 days ago
|
|
This is a very fair point! If we had infinite compute then it's undeniable that transformers (i.e. full attention) would be better (exactly as you characterise it) But that's the efficiency-effectiveness tradeoff that we have to make: given that compute is limited, would we prefer attention over shorter sequences or SSMs over longer sequences? The answer is probably "well, it depends on your use case" - I can definitely see reasons for both! A fairly compelling thought for me is hybrid architectures (Jamba is a recent one). Here you can imagine having perfect recall over recent tokens and lossy recall over distant tokens. E.g. if the AI is generating a feature-length film, you "could imagine having Attention look at the most recent frames for short-term fluidity and an SSM for long-term narrative consistency" (quote from the OP) |
|