Hacker News new | ask | show | jobs
by jasondigitized 5 hours ago
A single 8h task? I'm sorry, but that's just asking for trouble.
5 comments

I don't understand how some of y'all use these things. I get garbage unless I give them very specific concrete tasks with as much context as possible. Anything that takes more than 30 min is usually a waste because the scope was too large.
I had good experiences doing multi-hour refactoring/housekeeping tasks that basically consisted of applying the same steps and rules n times.

Worth noting, a significant chunk of those runs involved the agent waiting for the compiler, linters, type checks, and test suites, as well as updating journals. It’s not the agent sputtering out code for eight hours straight.

And naturally I spend more time on manual verification in the end as much less of it is happening during the coding process.

> that basically consisted of applying the same steps and rules n times.

Why use a non-deterministic, possibly hallucinatory, definitely expensive, LLM when it sounds like a codemod is the perfect solution for this?

The money people spend on things I could probably do with an emacs macro...
Different people just have different concepts of what's garbage and what's not.

There seems to be some kind of AI hysteria going on, with people becoming so enamoured with the AI that they accept anything it produces as if it's some gift from the gods, while others just reject it prima-facie.

For example, the worst design I have seen recently was from a designer who pivoted into "vibe coding influencer". The worst code is from developers who were heavily into Clean Code a couple years ago and now half their PRs is unused dead code.

You have to build up a context, or otherwise seed the memory, to get anything useful out of these LLMs on a large or existing project.
Indeed, according to METR, Mythos only achieved an 80% success rate with 3 hour tasks. https://metr.org/time-horizons/
I use both Opus and Fable on tasks that are well beyond "things that would take a human 3 hours"

It fails all the time - as in it ends up doing something I want to change.

But this doesn't actually matter - if it takes 3 or 4 iterations on something that would have taken me a week it might be a day of human work, but it's still 5 times better than doing it by hand.

Those are tasks that would take a human 3 hours to complete, not tasks that the model works on for 3 hours.
That’s even smaller then!
This sounds like classic "you're using it wrong", if they had said it was done in smaller tasks you would very likely have people here saying that was wrong too.
My record for a single uninterrupted session (albeit with Codex, not Claude) is 80+ hours. It was very productive, too.

The trick is having large, extensive test suites and forcing the agent to run them regularly.

if there're some specific tests/evals to satisfy that an agent can test by itself, it can easily iterate for hours. And this time also includes running those tests/evals, which may not be small.