Hacker News new | ask | show | jobs
by _zoltan_ 24 days ago
"In this post, I’ll cover a third, not-so-obvious approach: building ways for the agent to validate more of its own work before a human has to step in. "

this has been an obvious thing to do since at least January (since Geoffrey Huntley published "everything is a ralph loop"), and this is how I've been working: build enough orchestration tooling to be able to automate everything: development container bringup, building it, running the unit tests, doing integration testing, and using the software as eventually an end user. then to iterate set performance goals on an already solid basis so the automated agent ("gym") can go and iterate autonomously, and let you know when it's "done".

I understand this probably does not work if you're on some subscription and not using the API (tokens burn fast), but this has been extremely productive for me.

8 comments

You can get really far with the 20x Claude Code and Codex plans. They are many orders of magnitude cheaper than api calls.
Enjoy that until the token economy comes crashing down on you.
Anthropic is profitable. When will people stop pretending like AI has not found real applications where it creates value?

It's here to stay, and IMO once VLA-driven robots enter the real world there will be enough money to pay for the datacenters. This coding stuff is great but there are only so many engineers to sell to.

Satya Nadella once said (more or less) "if AI is so good, why doesn't it show up in GDP?"

That's gonna be the step where it shows up in the GDP. Being able to train a machine to solve any problem that can be phrased in tokens (i.e.: most of them) is going to remake society.

Agreed! Until they fit. :-)
This is where most of my productivity gains have come, I have a special harness I move from project to project now that does my testing orchestration, lots of my work day is setting up a prompt or two early and just letting them loop till they return evidence that the feature is working having gone through the big QA loop.

I've slowly been optimizing for token use through the stack and Claude ends up making very tight for loops for most of the process and keeping token count even lower. It's been nice. A lot of my toil at work is just gone.

I can see how you could avoid regressions this way, but what do you add to your harness to prove that a new feature is working?
I have it record a series of gifs or videos that I look over. If something looks off I'll dig into it, but I break down work into very very small chunks that are usually easily verifiable or don't require multiple steps.

Another thing I have in the general sdlc process is having it add enough logging to verify features are turned on, configured as we expected, and that becomes enough feedback for most of my features.

I've been mostly focusing on being able to replicate this across stacks greater than 3 projects so far (with the eventual goal of having an agent be able to orchestrate our complete infra stack, and this being a large component of a DR plan to rebuild).

None of this is really new for us, I'm just the most knowledgeable in my group in how the different products across teams glue together so I've been creating these rube goldbergs as a prototype, and then having it iterate on codifying the parts that don't need a constant LLM. We were blessed to have an engineer a decade ago build out tooling for local container automation that matches 95% of the deployed infra stack. That last 5% sucks when you fall into it, but that's always been a truth. I've added and expanded the tool over the years with making it act more like the deployed environment networking wise, but a lot of things don't end up working well in docker containers on M series macs when most of our complicated virtualization in our private cloud can't run on them yet...

I’ve been building this out too, and your comment made me realize the missing piece for me. I’ve given the agents tools to validate its own work, but I haven’t improved the experience of humans verifying the agents’ work.
For video/image stuff I found the ability for the LLMs to use ffmpeg and imagemagick to be quite fun.
for us it's (usually) very easy as I work on performance optimization. a non-negligible part of this is correctness and verifiability, so we already have some of that.

to give you an example just recently I've coded a feature that for our shuffle operation can report which channel did the bytes flow through (as the PR giving us the plumbing underneath has landed upstream recently). what this basically means is that you run the shuffle, you know you've shuffled X bytes (because you have stats on both ends) and then you need to attribute them to different layers. on the first iteration, the count was off. the agent went, debugged, fixed, iterated, and then it was 1.5% off. again, it went, iterated, ... and now we're fine.

part of the task description was that the breakdown must match the known amount of bytes we're shuffling, so the agent took this upon as a self-verification point. so besides running our normal, boring unit tests, integration tests and end-to-end verification harnesses (which it not only has programmatic/cli/API access, but are documented in .md files for projects), it could use this criteria on top to verify.

looking at /usage, my API duration was 2h 43m, and on top of that:

      claude-haiku-4-5:  2.7k input, 115.3k output, 16.3m cache read, 867.9k cache write ($3.30)
       claude-opus-4-8:  46.9k input, 555.0k output, 166.6m cache read, 2.9m cache write ($115.77)
Definitely agree that performance optimization is a good use case for LLMs. Here you have both a measurable goal / objective function and guardrails against functional regressions. It kind of closes the loop in that regard.

One thing however is a test suite is not usually exhaustive in the sense that any code that passes the tests is valid. Usually tests are more complimentary in nature. Therefore you could still possibly get code degradation, potentially.

> One thing however is a test suite is not usually exhaustive in the sense that any code that passes the tests is valid. Usually tests are more complimentary in nature.

Not in the world of AI - if your tests don't catch any known issues, the problem is the tests aren't comprehensive enough. There's no excuse at this point not to have an incredibly comprehensive test suite, to go with your other agent feedback loop constraints

>> if your tests don't catch any known issues, the problem is the tests aren't comprehensive enough.

Maybe I misunderstand but this seems like a fairly low bar in the test suite only covers existing bugs.

I'd argue that if you aren't going to look at the code you actually need a fully comprehensive test suite - in the sense that if the tests pass, the code is correct and you don't have to look at it at all. The problem is, that isn't very quick to create it seems. Of course, if there is a way to do it quickly in a way that is reproducible by others I'd love to hear about it.

The rest of the paragraph explanations are more important.

"The goal to make longer unattended sessions safe enough to be useful without fully removing the human from the loop. It should also reduce the number of low-quality PRs your teammates have to review for details the agent should have caught itself."

>safe enough to be useful without fully removing the human from the loop

This is the fundamental concept for AI usage, assistance and adoption for every fields not only code generation.

Essentially AI including LLM, ML, DL, is just a tool, like any other automation tools operating based on the principle of expert-in-the-loop as safety and quality gatekeeper, for sensible and responsible decision making [1].

[1] Domain expertise has always been the real moat (brethorsting.com) (519 comments):

https://news.ycombinator.com/item?id=48340411

it's fine to remove the human from the loop. set a macro goal, tell the agent how you think it could go there, and let it go nuts.

with enough scaffolding around self-reflectivity and metrics, it will converge.

> I understand this probably does not work if you're on some subscription and not using the API (tokens burn fast), but this has been extremely productive for me.

Do you have infinite money?

compared to salaries, our bill isn't outrageous.
It is too bad that so many companies have built these huge microservices beasts where you cannot run much more than unit tests locally, and you pretty much have to guess at the impact of your change until you have merged it, deployed it, and turned on the feature flag for your own account / test account. This loop is slow, which is perhaps a cost they are willing to pay, but it is also not safe for agents, which will be a massive setback.
What’s the cost of all that though? I don’t doubt that productivity could be gained but when I see articles like the one on the Open Claw guy spending 1.3 million on tokens in a single month I am reminded of drag racing engines that can reach incredible speeds but also need to be completely rebuilt after a single race.
Depends on the quality of your validation loop. Can the agent find the bug in a five second unit test, or does it have to run the full deployment test?

It also presents tradeoffs in compute budget. Cycles spent executing large arrays of tests could mean less tokens spent debugging.

> Depends on the quality of your validation loop. Can the agent find the bug in a five second unit test, or does it have to run the full deployment test?

I am not asking about time or completeness. I am asking if this person is spending 1 dollar to make more than a dollar, or if they are spending 1 dollar to make less than a dollar.

Any other criteria is not necessary to consider, if the activity is not profitable.

Who cares about that?

Vibes, baby, vibes!!!

Based on the person I was talking to’s reply, it does seem that way
I don't know if you guys are trolling or not, honestly. Vibe coding is misrepresented. I'm building extremely complex features that are grounded in our codebase but fundamentally rests on the training data which includes all academic papers on the very subject I am iterating on.

you can't just handwave this all away with "vibe baby, vibe" and then high-fiving each other that oh you're so clever because you manually write code/think the code is too high.

I can't give exact numbers but I find the cost more than OK.
yep, this has been obvious to a lot of people for awhile. especially after Cherny posted about exactly this in a massively-popular thread... four months ago: https://x.com/bcherny/status/2007179861115511237
What license do you use then?
you can pay by just volume ("API pricing")