|
|
|
|
|
by in-silico
48 days ago
|
|
People are trying to solve it with software too, even if you don't hear about it. The most high-profile example is the latest set of Qwen models, which replace most of the attention mechanisms with Gated DeltaNet (which uses constant memory with respect to sequenc length). Test-time training architectures are also getting a lot of attention, and have shown great performance in the acedemic setting. It's only a matter of time before we start getting open TTT models. |
|