Hacker News new | ask | show | jobs
by in-silico 48 days ago
People are trying to solve it with software too, even if you don't hear about it.

The most high-profile example is the latest set of Qwen models, which replace most of the attention mechanisms with Gated DeltaNet (which uses constant memory with respect to sequenc length).

Test-time training architectures are also getting a lot of attention, and have shown great performance in the acedemic setting. It's only a matter of time before we start getting open TTT models.