|
|
|
More tokens, less cost: why optimizing for token count is wrong
|
|
1 points
by nicola_alessi
108 days ago
|
|
I ran a controlled benchmark on AI coding agents (42 runs, FastAPI, Claude Sonnet 4.6) and found something that broke my mental model of LLM costs.
The setup: I built an MCP server that pre-indexes a codebase into a dependency graph and serves pre-ranked context to the agent in a single call, instead of letting the agent explore files on its own.
The expected result: less input context → lower cost. Straightforward.
The actual result: total tokens processed went UP 20% (23.4M vs 19.6M) while total cost went DOWN 58% ($6.89 vs $16.29).
The explanation is in how Anthropic prices tokens. There are three pricing tiers: Output tokens: most expensive (3-5x input price)
Input tokens (cache miss): full price
Input tokens (cache hit): 90% discount The agent with pre-indexed context processes more total tokens because the structured context payload is injected every turn. But the token MIX shifts dramatically:
Output tokens: 10,588 → 3,965 (-63%)
Cache read rate: 93.8% → 95.3%
Cache creation: 6.1% → 4.6%
Output tokens dominate the cost equation. When the agent receives 40K tokens of unfiltered context, it generates verbose orientation narration ("let me look at this file... I can see that..."). When it receives 8K tokens of graph-ranked context, it skips straight to the answer. 504 output tokens per task → 189.
The cache effect compounds this: structured, consistent context across turns hits the cache more reliably than ad-hoc file reads that change every turn. So the additional input tokens cost almost nothing (90% discount) while the output token reduction saves the most expensive tokens.
The general principle: with tiered token pricing, optimizing for total token count is wrong. You should optimize for token mix — push volume from expensive tiers (output, cache miss) to cheap tiers (cache hit). More total tokens can cost less if you shift the distribution.
This seems obvious in retrospect but I haven't seen it discussed much. Most context engineering work focuses on reducing input tokens. The bigger lever might be reducing output tokens by improving input signal-to-noise ratio — the model writes less when it doesn't have to think out loud about what it's reading. The tool is vexp (https://vexp.dev) — local-first context engine, Rust + tree-sitter + SQLite. Free tier available. |
|
When you optimize for a structured context payload (like your dependency graph), you aren't just hitting the Anthropic pricing cache—you are literally reducing the routing entropy at the inference level. High-noise inputs force the model into 'exploratory' output paths, which isn't just expensive in dollars, but also in hardware stress.
We found that 'verbose orientation narration' (the thinking-out-loud part) correlates with higher entropy spikes in memory access. By tightening the input signal-to-noise ratio, you're essentially stabilizing the model's internal routing. Have you noticed any changes in latency variance (jitter) between the pre-indexed and ad-hoc runs? In our tests, lower entropy usually leads to much more predictable TTFT (Time To First Token).