Hacker News new | ask | show | jobs
by bakugo 340 days ago
The initial context processing is also cached, which is why there's a significant discount on the input token cost.
1 comments

What exactly is cached though? Each loop of token inference is effectively a recursive loop that takes in all context plus all previously inferred tokens, right? Are they somehow caching the previously inferred state and able to use that more efficiently than if they just cache the context then run it all through inference again?