|
|
|
|
|
by visarga
809 days ago
|
|
Yeah all attempts at reducing complexity from quadratic to linear failed, only Mamba still has a chance, but it's not tested on large models and only provides a speedup at for 2000+ tokens. That was to be expected as small sequences have very small memory requirements for transformers, but recursive architectures use the same hidden size. So when recurrent hidden size > sequence length, the old transformer is faster. |
|