| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by gamegoblin 166 days ago
	I routinely leave codex running for a few hours overnight to debug stuff If you have a deterministic unit test that can reproduce the bug through your app front door, but you have no idea how the bug is actually happening, having a coding agent just grind through the slog of sticking debug prints everywhere, testing hypotheses, etc — it's an ideal usecase

3 comments

nikkwong 166 days ago

I have a hard time understanding how that would work — for me, I typically interface with coding agents through cursor. The flow is like this: ask it something -> it works for a min or two -> I have to verify and fix by asking it again; etc. until we're at a happy place with the code. How do you get it to stop from going down a bad path and never pulling itself out of it?

The important role for me, as a SWE, in the process, is verify that the code does what we actually want it to do. If you remove yourself from the process by letting it run on its own overnight, how does it know it's doing what you actually want it to do?

Or is it more like with your usecase—you can say "here's a failing test—do whatever you can to fix it and don't stop until you do". I could see that limited case working.

link

woah 166 days ago

For some reason setting up agents in a loop with a solid prompt and new context each iteration seems to result in higher quality work for larger or more difficult tasks than the chat interface. It's like the agent doesn't have to spend half its time trying to guess what you want

link

vel0city 166 days ago

You do things like ralph loops.

https://github.com/snarktank/ralph

Its constantly restarting itself, looking at the current state of things, re-reading what was the request, what it did and failed at in the past (at a higher level), and trying again and again.

link

gamegoblin 166 days ago

I use Codex CLI or Claude Code

I don't even necessarily ask it to fix the bug — just identify the bug

Like if I've made a change that is causing some unit test to fail, it can just run off and figure out where I made an off-by-one error or whatever in my change.

link

p1esk 166 days ago

“here's a failing test—do whatever you can to fix it”

Bad idea. It can modify the code that the test passes but everything else is now broken.

link

SatvikBeri 166 days ago

I've heard this said a lot but never had this problem. Claude has been decent at debugging tests since 4.0 in my experience (and much better since 4.5)

link

zem 166 days ago

it's more like "this function is crashing with an inconsistent file format error. can you figure out how a file with the wrong format got this far into the pipeline?". in cases like that the fix is usually pretty easy once you have the one code path out of several thousands nailed down.

link

tsss 166 days ago

How can you afford that?

link

wahnfrieden 166 days ago

It costs $200 for a month

link

addaon 166 days ago

> it's an ideal usecase

This is impressive, you’ve completely mitigated the risk of learning or understanding.

link

arcanemachiner 166 days ago

Or, they have freed up time for more useful endeavours, that may otherwise have spent on drudgery.

I don't discount the value of blood, sweat and tears spent on debugging those hard issues, and the lessons learned from doing so, but there is a certain point where it's OK to take a pass and just let the robots figure it out.

link