Hacker News new | ask | show | jobs
by visarga 809 days ago
Yeah all attempts at reducing complexity from quadratic to linear failed, only Mamba still has a chance, but it's not tested on large models and only provides a speedup at for 2000+ tokens. That was to be expected as small sequences have very small memory requirements for transformers, but recursive architectures use the same hidden size. So when recurrent hidden size > sequence length, the old transformer is faster.
1 comments

It's more subtle than that IMO. They haven't necessarily "failed" - they just don't have the "superpowers" that the metrics used to evaluate such systems are aimed at. E.g. no such linear method devised so far (that I know of, at least) is able to do very high recall point retrieval in long context _and_ effective in-context learning simultaneously. You get one or the other, but not both. But as far as the metrics go, high recall retrieval in long context is easier to for the researcher to demonstrate and for the observer to comprehend - a typical needle/haystack setting is trivial to put together. It is also something that (unlike in-context learning) humans are usually very bad at, so it's perceived as a "superpower" or "magic". In this case e.g. Mamba being more human like due to its selective forgetfulness is currently playing against it. But whether it's "better" per se will depend on the task. It's just that we do not know how to evaluate most of the tasks yet, so people keep trying to find the proverbial keys under the lamp post, and measure what they can to make progress, and thereby keep their efforts lavishly funded.