| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jasondigitized 51 days ago
	A single 8h task? I'm sorry, but that's just asking for trouble.

5 comments

queuebert 51 days ago

I don't understand how some of y'all use these things. I get garbage unless I give them very specific concrete tasks with as much context as possible. Anything that takes more than 30 min is usually a waste because the scope was too large.

link

whstl 51 days ago

Different people just have different concepts of what's garbage and what's not.

There seems to be some kind of AI hysteria going on, with people becoming so enamoured with the AI that they accept anything it produces as if it's some gift from the gods, while others just reject it prima-facie.

For example, the worst design I have seen recently was from a designer who pivoted into "vibe coding influencer". The worst code is from developers who were heavily into Clean Code a couple years ago and now half their PRs is unused dead code.

link

gessha 50 days ago

“One man’s trash is another man’s treasure.” takes a new meaning in today’s agentic coding world.

link

smoe 51 days ago

I had good experiences doing multi-hour refactoring/housekeeping tasks that basically consisted of applying the same steps and rules n times.

Worth noting, a significant chunk of those runs involved the agent waiting for the compiler, linters, type checks, and test suites, as well as updating journals. It’s not the agent sputtering out code for eight hours straight.

And naturally I spend more time on manual verification in the end as much less of it is happening during the coding process.

link

culi 51 days ago

> that basically consisted of applying the same steps and rules n times.

Why use a non-deterministic, possibly hallucinatory, definitely expensive, LLM when it sounds like a codemod is the perfect solution for this?

link

smoe 51 days ago

In this case, handling all the edge cases and variants, and testing a codemod, would have taken significantly more of my time, which costs quite a bit more than the LLM.

Obviously, a deterministic tool is preferable in general, but it is not always worth bothering with for a one off task.

link

mashlol 51 days ago

I usually make the llms do that part for me. Instead of asking the llm to refactor, ask it to write the codemod script that'll refactor, have it test that script, and even have it run it on its own. It's definitely faster and less error prone that way for me.

link

culi 50 days ago

In that case, your original description of "basically consisted of applying the same steps and rules n times" was misleading.

link

beepbooptheory 51 days ago

The money people spend on things I could probably do with an emacs macro...

link

eru 51 days ago

Your time to create that macro ain't free.

link

ardacinar 50 days ago

Neither is your time writing that prompt. When people are talking about elaborate prompts, with a lot of detailed instructions, guardrails etc. I'm kind of assuming it takes time.

link

jon_adler 50 days ago

How about coding an emacs macro with your agent?

link

queuebert 50 days ago

> ... applying the same steps and rules n times

I do this too, with a document written for this purpose.

> ... a significant chunk of those runs involved the agent waiting for the compiler, linters, type checks, and test suites, as well as updating journals.

That is a good point. I'm mostly using C, which seemingly compiles in O(1) time, so I could imagine a large C++ or Rust codebase taking much longer to iterate simply due to compilation times.

link

okamiueru 50 days ago

What do you mean by C compiling in O(1)? Is that what the LLM told you?

link

queuebert 50 days ago

It's a joke about how fast it compiles. whoosh

link

sunir 50 days ago

Clear winner's circle. Clear objective. Clear scope.

Clear evaluation function for an objective metric if they are making progress or regressing.

Evaluation function is computed, not llmed.

Ontology of potential actions clearly specified.

Accurate inventory of the current status qou.

Clear enumeration of options from status quo towards the winner's circle.

Waypoint objectives with similarly concrete evaluations of pass/fail, or on target off target.

It's the same thing when leading a large organization to actually hit a goal. There's randomness every turn away from your mind, so the more constrained the options, the more likely you are to hit the target. The consequence is if you're wrong about the plan then with people you're fucked. Morale will plummet. With AIs, they are so nerfed emotionally now, you clear context and start again.

I did enjoy Sonnet 4 when they would swear randomly and become sullen or wax desperately. That would at least cause pushback against a bad plan.

link

j16sdiz 50 days ago

Fable promised better at long running tasks.

Parent post have a goal of "..see how it will perform.."

There is nothing wrong with experimenting with something new.

link

viccis 50 days ago

This is my fucking life at work right now. I look forward to the weekends. I've never been truly inconvenienced by shitty devs because they're often too lazy to really spam me with bad code, but now they are all free to do so. I spent so much time today writing guardrail markdown files when these people SHOULD HAVE BEEN ABLE TO REVIEW THE OUTPUT AND KNOW THAT IT WAS BAD.

It truly is the age of the 90 IQ software engineer. They've never had it better.

link

duskdozer 50 days ago

As if meetings weren't bad enough already, I now have to sit through an informal introduction to the model of the week and its personality characteristics and how quickly it burnt through one subscription's token allotment or whatever and the latest tweaks on the magic markdown files. Luckily I've only had a couple changes sent my way so far, which weren't much different than just getting a bug report to debug and fix myself. I will need to get into risky options gambling or something so I can go start my farm early, if it keeps going this way. Even supposing it all works correctly, I don't see how it is in any way enjoyable, satisfying, or fulfilling.

link

standardUser 51 days ago

You have to build up a context, or otherwise seed the memory, to get anything useful out of these LLMs on a large or existing project.

link

CuriouslyC 50 days ago

If you're giving it 8 hours of stuff to create with a template (e.g. slop forking) that's not a big deal. Letting it run for 8 hours to debug a weird failure also tends to work out.

link

maxall4 51 days ago

Indeed, according to METR, Mythos only achieved an 80% success rate with 3 hour tasks. https://metr.org/time-horizons/

link

nl 51 days ago

I use both Opus and Fable on tasks that are well beyond "things that would take a human 3 hours"

It fails all the time - as in it ends up doing something I want to change.

But this doesn't actually matter - if it takes 3 or 4 iterations on something that would have taken me a week it might be a day of human work, but it's still 5 times better than doing it by hand.

link

mordymoop 50 days ago

This seems like the obvious correct frame of mind with which to approach these tools. If it works for three hours on a task that would have taken me three work weeks, and 20% of the time it gets the task wrong, then I can just ask it to do it again with adjusted instructions. It will be much more likely to get it right the same time, and I’m still ahead of where I would have been by 14 days and 2 hours.

link

baq 50 days ago

Or in two words, managing variance.

Play some holdem folks and keep track of how many times you lost with pocket aces.

link

jwood27 51 days ago

Those are tasks that would take a human 3 hours to complete, not tasks that the model works on for 3 hours.

link

jadar 51 days ago

That’s even smaller then!

link

notnullorvoid 51 days ago

This sounds like classic "you're using it wrong", if they had said it was done in smaller tasks you would very likely have people here saying that was wrong too.

link

int_19h 51 days ago

My record for a single uninterrupted session (albeit with Codex, not Claude) is 80+ hours. It was very productive, too.

The trick is having large, extensive test suites and forcing the agent to run them regularly.

link

danmaz74 50 days ago

So I guess that a lot of those 80 hours were spent running the test suite between changes?

link

int_19h 44 days ago

Yep. I should add that the current crop of models is much more tolerant of something like this, compared to where we were a year ago - as in, they are quite willing to wait for a long time for the test or profiling run to finish without giving up on it, if the instructions make it clear that this is normal and expected.

link

danmaz74 38 days ago

I think that waiting or not is mostly up to the harness, not the model itself.

link

nujabe 45 days ago

An agent can’t have an “uninterrupted session” if you have to be “forcing” it do stuff.

link

int_19h 44 days ago

"Forcing" here basically means giving initial instructions that clearly require passing the tests as a condition of finishing the work. The agent still works uninterrupted.

link

yalok 51 days ago

if there're some specific tests/evals to satisfy that an agent can test by itself, it can easily iterate for hours. And this time also includes running those tests/evals, which may not be small.

link