| HN Mirror

It's more subtle than that IMO. They haven't necessarily "failed" - they just don't have the "superpowers" that the metrics used to evaluate such systems are aimed at. E.g. no such linear method devised so far (that I know of, at least) is able to do very high recall point retrieval in long context _and_ effective in-context learning simultaneously. You get one or the other, but not both. But as far as the metrics go, high recall retrieval in long context is easier to for the researcher to demonstrate and for the observer to comprehend - a typical needle/haystack setting is trivial to put together. It is also something that (unlike in-context learning) humans are usually very bad at, so it's perceived as a "superpower" or "magic". In this case e.g. Mamba being more human like due to its selective forgetfulness is currently playing against it. But whether it's "better" per se will depend on the task. It's just that we do not know how to evaluate most of the tasks yet, so people keep trying to find the proverbial keys under the lamp post, and measure what they can to make progress, and thereby keep their efforts lavishly funded.