| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by HarHarVeryFunny 105 days ago

It seems a lot of the problem isn't "token shrinkage" (reducing plan limits), but rather changes they made to prompt caching - things that used to be cached for 1 hour now only being cached for 5 min.

Coding agents rely on prompt caching to avoid burning through tokens - they go to lengths to try to keep context/prompt prefixes constant (arranging non-changing stuff like tool definitions and file content first, variable stuff like new instructions following that) so that prompt caching gets used.

This change to a new tokenizer that generates up to 35% more tokens for the same text input is wild - going to really increase token usage for large text inputs like code.

1 comments

mnicky 105 days ago

> things that used to be cached for 1 hour now only being cached for 5 min.

Doesn't this only apply to subagents, which don't have much long-time context anyway?

HarHarVeryFunny 104 days ago

AFAIK the way caching works is at API key level, which will be shared across the main/parent agent and all subagents.

Note that the model API is stateless - there is no connection being held open for the lifetime of any agent/subagent, so the model has no idea how long any client-side entity is running for. All the model sees over time is a bunch of requests (coming from mixture of parent and subagents) all using the same API key, and therefore eligible to use any of the cached prompt prefixes being maintained for that API key.

Things like subagent tool registration are going to remain the same across all invocations of the subagent, so those would come from cache as long as the cache TTL is long enough.