Ask HN: How are you keeping AI coding agents from burning money? | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

Ask HN: How are you keeping AI coding agents from burning money?

8 points by bhaviav100 122 days ago

My agents retry a bit more than it should, and there goes my bill up in the sky. I tried figuring out what is causing this but none of the tools helped much.

and the worse thing for me is that everything shows up as aggregate usage. Total tokens, total cost, maybe per model.

So I ended up hacking together a thin layer in front of OpenAI where every request is forced to carry some context (agent, task, user, team), and then just logging and calculating cost per call and putting some basic limits on top so you can actually block something if it starts going off the rails. It’s very barebones, but even just seeing “this agent + this task = this cost” was a big relief.

It uses your own OpenAI key, so it’s not doing anything magical on the execution side, just observing and enforcing.

I want to know you guys are dealing with this right now. Are you just watching aggregate usage and trusting it, or have you built something to break it down per agent / task?

If useful, here is the rough version I’m using : https://authority.bhaviavelayudhan.com/

14 comments

cableshaft 121 days ago

I have yet to go over the standard limits (by that I mean whatever is included with my normal Copilot and ChatGPT subscriptions), but I only started really using it the past couple of months.

At least right now I don't see myself going over the standard limits for my personal projects. I'm not implementing more than 2-3 features a day at most, so only a handful of prompts.

I have come close to reaching the limit at work this month, but I don't have to pay for the extra requests so I'm not super concerned. I also hit it pretty hard this month with a pretty gnarly upgrade that required a lot of back and forth, which is what accumulated a lot of that usage.

Not sure how you're using so much that you're getting big bills out of this.

bhaviav100 120 days ago

Especially when you are running multiple agents for research

DarthCeltic85 122 days ago

I had gotten a student/ultra code for antigravity promo for three months, so I was using that, but that finally ran out this month. Currently Im using windstream and flipping between claude as my left brain and code extraction and the higher context but cheaperish models there.

honestly though, im getting to a point where im running custom project mds that flip between different models for different things, using list outputs depending on what it finds and runs. (I have two monorepo projects, and one thats a polyglot microengine that jumps using gRPC communication.)

The mds are highly specialized for each project as each project deals with vastly different issues. Cycling through the different pro accounts and keeping the mds in place over it all is helping me not kill my wallet.

bhaviav100 122 days ago

hmm interesting model routing + specialized MDs makes sense for cost efficiency.

I’m seeing a different failure mode though that even with good routing, agents are looping or retrying and burning my money.

trcarney 120 days ago

I would suggest trying a different harness. I only use OpenCode and haven't run into limits on even the $20 OpenAI plan. I will say that I use it more as an assistant and don't usually just say "hey go do this thing" without a plan broken down into tasks. With that being said, i don't think i have used even 50% of my 5 hour allotment, even when having it do everything.

So all that to say, maybe try other harnesses because it could just be a prompt issue in the harness.

I am also an unabashed opencode shill, so take that into account too.

stockyarddev 119 days ago

The per-route breakdown is exactly what's missing from the native dashboards. I've been running Trough for this: it sits in front of your HTTP API calls and tracks cost by route, so you can see "this endpoint is costing $X/day" rather than just an aggregate total. Retry storms show up as a spike on a specific route, which makes it easy to pinpoint the loop. Self-hosted Go binary, free for one service. stockyard.dev/trough

bhaviav100 119 days ago

This is great visibility..just checked the website..I will try this over weekend

stockyarddev 117 days ago

Awesome! let me know how it goes. If you hit anything weird during setup, open an issue on GitHub (github.com/stockyard-dev/stockyard-trough) or email me directly: michael@stockyard.dev. The install is just `curl -fsSL stockyard.dev/trough/install.sh | sh` and it should be running in about 30 seconds

sminchev 121 days ago

Some things that I know how to do, I just run myself. If starting the tests is a bash command, I asked the AI to create bash script that does this, and then I run it myself. Same, with the build, deploy and other similar tasks. For some no so important tasks, I use different model, like GLM, which is cheaper. Then I save the result of the, let's say bug analysis, or code review, and ask my main model (Opus) to read the document and execute the task. This way I use my expensive model to write the tasks, but the cheaper one to do the analysis.

bhaviav100 121 days ago

I haven't tried this .will do

rox_kd 122 days ago

In what settings do you mean - there are multiple strategies, I think building your own compaction layer in front seems a bit over-kill ? have you considered implementing some cache strategy, otherwise summary pipelines - I made once an agent which based on the messages routed things to a smaller model for compaction / summaries to bring down the context, for the main agent.

But also ensuring you start new fresh context threads, instead of banging through a single one untill your whole feature is done .. working in small atomic incrementals works pretty good

bhaviav100 122 days ago

yes, compaction and smaller models help on cost per step.

But my issue wasn’t just inefficiency, it was agents retrying when they shouldn’t.

I needed visibility + limits per agent/task, and the ability to cut it off, not just optimize it.

rox_kd 122 days ago

I'm working on a fun project I call OpenFAST, which essentially tries to solve the context transitioning - but its still in early days and haven't released anything yet.

I think one of the bigger issues, is the o(n) orchestration to agent calls that often feels uncontrolled .. ending up making the orchestrator of sub-agents the main bottleneck due to the large context it sometimes ends up with.

I'm working on an idea where agents delivers briefs & deliveres as real artifacts, and then having each spawned sub-agent read briefs, and if they need further information pick up the delivery for that specific brief.

It helps drift detection across agents, and the best part is orchestrator only delegates jobs, but doesn't do much further than that.

Whenever sub-agents has delivered their tasks, orchestrator can then read a merged brief/delivery for that specific round.

So far it helps cutting that extra tool call where each sub-agent answers the orchestrator - but it also helps the orchestrator only dwelve into deliveries which it believes are relevant rather than trying to understand and comprehend every small detail.

I can share more when I'm a bit further maybe you could get some inspiration here.

bhaviav100 122 days ago

This is interesting and I would love to understand more on this..is there a GitHub which I can look at?

Here's something which would help you with another perspective on the contexts https://authority.bhaviavelayudhan.com/journal/35

rox_kd 121 days ago

I haven't published anything yet, its still an early PoC - Thanks will read this!

jerome_mc 122 days ago

AI outputs often feel like a gacha game. Paradoxically, the 'expensive' tokens are sometimes the cheapest in the long run. In my experience, higher-end models have a much higher 'one-shot' success rate. You aren't just saving on total token count by avoiding loops; you’re saving engineering time, which is always the most expensive resource anyway.

bhaviav100 122 days ago

Both yes and no .we don't have a way to predict or forecast this

paulwelty 122 days ago

Mine have burned a lot of money! Right now, I'm trying to keep the context smaller. It takes a lot of discipline, though, to have a system that gives enough context to do the work but not so much the agent can go off doing new/crazy stuff.

bhaviav100 122 days ago

If only there was a way to manage contexts better

grahammccain 122 days ago

Kinda of an adjacent question but do you think the token/usage way of paying for things will stick? I still think people would rather pay a monthly subscription for a seat.

bhaviav100 122 days ago

Companies won't survive with seats pricing

https://www.theoperatorscircle.com/journal/36

bisonbear 122 days ago

cost control is a policy problem - we certainly don't need to use opus 4.6 for a simple test refactor, but many people (including myself) default to it anyways. we need a way to measure cost / performance for agents on individual repos, with individual types of tasks, to get a better sense of what tasks can be trusted to cheaper agents, and what tasks must be routed to the SOTA

bhaviav100 122 days ago

Exactly why I built this.

But cost control is not an entirely policy problem. Policies are just guidelines.

sph 121 days ago

Interesting how 50% of the comments in this thread, flagged or not, are from green accounts.

Eternal September 2.0: the LLM edition.

mohit17mor 121 days ago

Not really sure if you mean setups like OpenClaw as well, but I ran into pretty similar issues there.

That was actually a big part of why I started building my own agent system, Arc agent, mostly just to see if I could solve some of the problems every agent faces. A few practical things helped, putting limits on iteration/retry loops so the agent can’t wander forever, being stricter about context/tool handling once external skills are in play and adding a simple cost/token counter so I could at least see where usage was going. I also tried to reduce the usage of llm whereever a simple code would work more reliably.

I tried to tackle few other problems like downtime during context compaction etc, but yeah the project is still work in progress and i am experimenting with diff stuff.

bhaviav100 120 days ago

Sounds exciting..I liked the token counter concept. Didn't thought about it though. Do you have a GitHub repo?

spl757 122 days ago

Don't use tech with deep, unresolved flaws and you won't get fucked.

Would you find it acceptable if Postgresql occassionally hallucinated and returned gibberish? Fuck no.

Wny is this okay with ANY software? Answer, it's not. AI IS NOT READY.

bhaviav100 122 days ago

The only way to make something better is to use it more

stephenr 121 days ago

It's wild that these companies have convinced you to pay to be a beta (at best; arguably much of this is pre alpha quality shit) tester and you're perfectly happy with that scenario.

spl757 122 days ago

By not using it. The tech is flawed. It hallucinates. It's not production ready. I've said it before, and I will say it again. Anyone using AI in a production environment is a fucking idiot.