Hacker News new | ask | show | jobs
by meandmycode 834 days ago
Also, while the cost of tokens is lower, let's argue it's cheap enough not to care. Reading 1m tokens surely isn't realistic for latency?
1 comments

If sub-quadratic architectures (eg Mamba) become a thing, it will become feasible to precompute most of the work for a fixed prefix (i.e. system prompt) and the latency can be pretty minimal. Even with current transformers, if you have a fixed system prompt, you can save the KV cache and it helps a lot (though the inference time of each incremental token is still linear).