|
|
|
|
|
by das-bikash-dev
117 days ago
|
|
The multi-agent budget problem you're describing gets even harder when the
services are heterogeneous. In a RAG pipeline, a single user query might hit:
query analysis (LLM call), embedding generation (different model/pricing),
reranking (yet another model), and response generation (LLM call) — each
potentially in a different process. Per-call monkey-patching sees each call in isolation. What I ended up doing was
a trace-based approach: every request gets a trace ID, each service appends cost
spans asynchronously, and a separate enrichment step aggregates the total. The
hard part was deduplication — when service A reports an aggregate cost and
service B reports the individual calls that compose it, you need to reconcile
or you double-count. Your atomic disk writes for halt state is a nice pattern. I went with
fire-and-forget (never block the request path, accept eventual consistency on
cost data) but that means you can't do hard enforcement mid-request like
AgentBudget does. |
|