| HN Mirror

> Since the devs on HN (& the whole world) is buying what looks like nonsense to me - what am I missing?

Input tokens are expensive, since the whole model has to be run for each token. They're cheaper than output tokens because the model doesn't need to run the sampler, so some pipeline parallelism is possible, but on the other hand without caching the input token cost would have to be paid anew for each output token.

Prompt caching fixes that O(N^2) cost, but the cache itself is very heavyweight. It needs one entry per input token per model layer, and each entry is an O(1000)-dimensional vector. That carries a huge memory cost (linear in context length), and when cached that means the context's memory space is no longer ephemeral.

That's why a 'cache write' can carry a cost; it is the cost of both processing the input and committing the backing store for the cache duration.