| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by nikkwong 124 days ago
	> Our latest frontier models have shown particular strengths in their ability to do long-running tasks, working autonomously for hours, days or weeks without intervention. I have yet to see this (produce anything actually useful).

8 comments

simonw 124 days ago

How hard have you tried?

I've been finding that the Opus 4.5/4.6 and GPT-5.2/5.3 models really have represented a step-change in how good they are at running long tasks.

I can one-shot prompt all sorts of useful coding challenges now that previously I would have expected to need multiple follow-ups to fix mistakes the agents made.

I got all of this from a single prompt, for example: https://github.com/simonw/research/tree/main/cysqlite-wasm-w... - including this demo page: https://simonw.github.io/research/cysqlite-wasm-wheel/demo.h... - using this single prompt: https://github.com/simonw/research/pull/79

link

aeyes 124 days ago

What do you mean? The generated script just downloads the sources and runs pyodide: https://github.com/simonw/research/blob/main/cysqlite-wasm-w...

There is maybe 5 relevant lines in the script and nothing complex at all that would require to run for days.

link

simonw 124 days ago

No, not for days - but it churned away on that one for about ten minutes.

I don't think I've got any examples of multi-hour or multi-day sessions that ran completely uninterrupted - this one back in December took 4.5 hours but I had to prompt it to keep going a few times along the way: https://simonwillison.net/2025/Dec/15/porting-justhtml/

link

AntiRush 123 days ago

This was a 24 hour task from a single prompt, GPT-5.2

https://tomisin.space/projects/graph-easy-ts/

link

andai 124 days ago

Maybe so, but I did once spend 12 hours straight debugging an Emscripten C++ compiler bug! (After spending the first day of the jam setting up Emscripten, and the second day getting Raylib to compile in it. Had like an hour left to make the actual game, hahah.)

I am a bit thick with such things, but just wanted to provide the context that Emscripten can be a fickle beast :)

I sure am glad I can now deploy Infinite Mechanized Autistic Persistence to such soul-crushing tasks, and go make a sandwich or something.

(The bug turned out to be that if I included a boolean in a class member, the whole game crashed, but only the Emscripten version. Sad. Ended up switching back to JS, which you basically need anyway for most serious web game dev.)

link

citizenpaul 121 days ago

How do you deal with the cost associated with a long running opus session? I asked it to validate some JSON configs against the spec yesterday and it burned $10 worth of tokens for what would have been a 1 millisecond linter task.

link

simonw 121 days ago

I'm on the $200/month Claude Max plan and I rarely run out of my token allowance.

I'm also paying $20/month for OpenAI Codex and again it's rare I hit the rate limits there.

link

basilgohar 124 days ago

Can you share any examples of these one-shot prompts? I've not gotten to the point where I can get those kind of results yet.

link

simonw 124 days ago

If you look through the commit logs on simonw/research and simonw/tools on GitHub most commits should either list the prompt, link to a PR with the prompt or link to a session transcript.

link

gamegoblin 124 days ago

I routinely leave codex running for a few hours overnight to debug stuff

If you have a deterministic unit test that can reproduce the bug through your app front door, but you have no idea how the bug is actually happening, having a coding agent just grind through the slog of sticking debug prints everywhere, testing hypotheses, etc — it's an ideal usecase

link

nikkwong 124 days ago

I have a hard time understanding how that would work — for me, I typically interface with coding agents through cursor. The flow is like this: ask it something -> it works for a min or two -> I have to verify and fix by asking it again; etc. until we're at a happy place with the code. How do you get it to stop from going down a bad path and never pulling itself out of it?

The important role for me, as a SWE, in the process, is verify that the code does what we actually want it to do. If you remove yourself from the process by letting it run on its own overnight, how does it know it's doing what you actually want it to do?

Or is it more like with your usecase—you can say "here's a failing test—do whatever you can to fix it and don't stop until you do". I could see that limited case working.

link

woah 124 days ago

For some reason setting up agents in a loop with a solid prompt and new context each iteration seems to result in higher quality work for larger or more difficult tasks than the chat interface. It's like the agent doesn't have to spend half its time trying to guess what you want

link

vel0city 124 days ago

You do things like ralph loops.

https://github.com/snarktank/ralph

Its constantly restarting itself, looking at the current state of things, re-reading what was the request, what it did and failed at in the past (at a higher level), and trying again and again.

link

gamegoblin 123 days ago

I use Codex CLI or Claude Code

I don't even necessarily ask it to fix the bug — just identify the bug

Like if I've made a change that is causing some unit test to fail, it can just run off and figure out where I made an off-by-one error or whatever in my change.

link

p1esk 124 days ago

“here's a failing test—do whatever you can to fix it”

Bad idea. It can modify the code that the test passes but everything else is now broken.

link

SatvikBeri 123 days ago

I've heard this said a lot but never had this problem. Claude has been decent at debugging tests since 4.0 in my experience (and much better since 4.5)

link

zem 124 days ago

it's more like "this function is crashing with an inconsistent file format error. can you figure out how a file with the wrong format got this far into the pipeline?". in cases like that the fix is usually pretty easy once you have the one code path out of several thousands nailed down.

link

tsss 124 days ago

How can you afford that?

link

wahnfrieden 124 days ago

It costs $200 for a month

link

addaon 124 days ago

> it's an ideal usecase

This is impressive, you’ve completely mitigated the risk of learning or understanding.

link

arcanemachiner 124 days ago

Or, they have freed up time for more useful endeavours, that may otherwise have spent on drudgery.

I don't discount the value of blood, sweat and tears spent on debugging those hard issues, and the lessons learned from doing so, but there is a certain point where it's OK to take a pass and just let the robots figure it out.

link

wahnfrieden 124 days ago

It worked for me several times.

It's easy to say that these increasingly popular tools are only able to produce useless junk. You haven't tried, or you haven't "closed the loop" so that the agent can evaluate its own progress toward acceptance criteria, or you are monitoring incompetent feeds of other users.

link

nikkwong 124 days ago

I'm definitely bullish on LLM's for coding. It sounds to me as though getting it to run on its own for hours and produce something usable requires more careful thought and setup than just throwing a prompt at it and wishing for the best—but I haven't seen many examples in the wild yet

link

foobar10000 124 days ago

It needs a closed loop.

Strategy -> [ Plan -> [Execute -> FastVerify -> SlowVerify] -> Benchmark -> Learn lessons] -> back to strategy for next big step.

Claude teams and a Ralph wiggum loop can do it - or really any reasonable agent. But usually it all falls apart on either brittle Verify or Benchmark steps. What is important is to learn positive lessons into a store that survives git resets, machine blowups, etc… Any telegram bot channel will do :)

The entire setup is usually a pain to set up - docker for verification, docker for benchmark, etc… Ability to run the thing quickly, ability for the loop itself to add things , ability to do this in worktree simultaneously for faster exploration - and got help you if you need hardware to do this - for example, such a loop is used to tune and custom-fuse CUDA kernels - which means a model evaluator, big box, etc….

link

wahnfrieden 124 days ago

I do it easily just by asking Codex

link

rcarmo 124 days ago

well, you can start with https://github.com/rcarmo/go-textile, https://github.com/rcarmo/go-rdp, https://github.com/rcarmo/go-ooxml, https://github.com/rcarmo/go-busybox (still WIP). All of these are essentially SPEC and test-driven and they are all working for me (save a couple of bugs in go-rdp I need to fix myself, and some gaps in the ECMA specs for go-ooxml that require me to provide actual manually created documents for further testing).

I am currently porting pyte to Go through a similar approach (feeding the LLM with a core SPEC and two VT100/VT220 test suites). It's chugging along quite nicely.

link

XCSme 124 days ago

Their ability to burn through tokens non-stop for hours, days or weeks without intervention.

link

raw_anon_1111 124 days ago

You’re mixing up Open AI for Anthropic.

Anthropic is actually sort of concerned with not burning through cash and charging people a reasonable price. Open AI doesn’t care. I can use Codex CLI all day and not approach any quotas with just my $20 a month ChatGPT subscription.

I treat coding agents like junior developers and never take my hand off the wheel except for boilerplate refactoring.

link

TheMuenster 124 days ago

Can I just say how funny this metric is?

"Our model is so slow and our tokens/second is so low that these tasks can take hours!" is not the advertising they think it is.

link

johnfn 124 days ago

The other day I got Codex to one-shot an upgrade to Vite 8 at my day job (a real website with revenue). It worked in this for over 3 hours without intervention (I went to sleep). This is now in production.

link

seunosewa 124 days ago

How did you verify it?

link

girvo 124 days ago

Just send it bro

(but honestly for a lot of websites and web apps you really can just send it, the stakes are very low for a lot of what most people do, if they're honest with themselves)

link

johnfn 123 days ago

Uhhh, this is my work, so… we didn’t have a SEV? None of our thousands of customers paying us money reported the site was broken?

link

ghosty141 123 days ago

I find this absolutely wild. From my experience Codex code quality is still not as good as a human so letting codex do smth and not verifying / cleaning up behind it will most likely result in lower code quality and possibly subtle bugs.

link

tinodb 122 days ago

For upgrading frameworks and such there are usually not that many architectural decisions to be made, where you care about how exactly something is implemented. Here the OP could probably verify the build works, with all the expected artifacts quite easily.

link

mikojan 123 days ago

Agreed. Optimistically let it resolve merge conflicts in an old complex branch. Looked fine at first but was utter slop upon further review. Duplication, wildly unnecessary complexity and all.

link

bitwize 124 days ago

PEBKAC

link