Hacker News new | ask | show | jobs
by kang 63 days ago
> tokens written to cache all at once, which would eat up a significant % of your rate limits

Construction of context is not an llm pass - it shouldn't even count towards token usage. The word 'caching' itself says don't recompute me.

Since the devs on HN (& the whole world) is buying what looks like nonsense to me - what am I missing?

1 comments

> Since the devs on HN (& the whole world) is buying what looks like nonsense to me - what am I missing?

Input tokens are expensive, since the whole model has to be run for each token. They're cheaper than output tokens because the model doesn't need to run the sampler, so some pipeline parallelism is possible, but on the other hand without caching the input token cost would have to be paid anew for each output token.

Prompt caching fixes that O(N^2) cost, but the cache itself is very heavyweight. It needs one entry per input token per model layer, and each entry is an O(1000)-dimensional vector. That carries a huge memory cost (linear in context length), and when cached that means the context's memory space is no longer ephemeral.

That's why a 'cache write' can carry a cost; it is the cost of both processing the input and committing the backing store for the cache duration.