| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by das-bikash-dev 117 days ago

The multi-agent budget problem you're describing gets even harder when the services are heterogeneous. In a RAG pipeline, a single user query might hit: query analysis (LLM call), embedding generation (different model/pricing), reranking (yet another model), and response generation (LLM call) — each potentially in a different process.

Per-call monkey-patching sees each call in isolation. What I ended up doing was a trace-based approach: every request gets a trace ID, each service appends cost spans asynchronously, and a separate enrichment step aggregates the total. The hard part was deduplication — when service A reports an aggregate cost and service B reports the individual calls that compose it, you need to reconcile or you double-count.

Your atomic disk writes for halt state is a nice pattern. I went with fire-and-forget (never block the request path, accept eventual consistency on cost data) but that means you can't do hard enforcement mid-request like AgentBudget does.

1 comments

tenpa0000 117 days ago

The deduplication problem is the part I haven't worked out cleanly. The hierarchy in veronica-core sidesteps it as long as you declare parent-child relationships upfront — B's spend rolls directly into A's ceiling without a separate aggregation step. But in a dynamic pipeline where you don't know the call graph until runtime, that assumption breaks. The fire-and-forget tradeoff makes sense. I went with blocking enforcement because the original use case was preventing runaway agents, not auditing after the fact. For RAG you're probably right that eventual consistency is the better fit — you care more about the trace than cutting off a half-finished response.