| HN Mirror

If sub-quadratic architectures (eg Mamba) become a thing, it will become feasible to precompute most of the work for a fixed prefix (i.e. system prompt) and the latency can be pretty minimal. Even with current transformers, if you have a fixed system prompt, you can save the KV cache and it helps a lot (though the inference time of each incremental token is still linear).