Hacker News new | ask | show | jobs
by nightshift1 251 days ago
I think that letting an LLM run unsupervised on a task is a good way to waste time and tokens. You need to catch them before they stray too far off-path. I stopped using subagents in Claude because I wasn't able to see what they were doing and intervene. Indirectly asking an LLM to prompt another LLM to work on a long, multi-step task doesn't seem like a good idea to me. I think community efforts should go toward making LLMs more deterministic with the help of good old-fashioned software tooling instead of role-playing and writing prayers to the LLM god.
4 comments

When the task is bigger than I trust the agent to work on it on its own, or for me to review the results, I ask it to create a plan with steps. Then create a md file for each step. I review the steps, and ask the agent to implement the first one. Review that one, fix it, then ask it to update the next steps, and then implement the next one. And so on, until finished.
Have you tried Scoped context packages? Basically for each task, I create a .md file that includes relevant file paths, the purpose of the task, key dependencies, a clear plan of action, and a test strategy. It’s like a mini local design doc. I found that it helps ground implementation and stabilizes the output of the agents.
I read this suggestion a lot. “Make clear steps, a clear plan of action.” Which I get. But then instead of having an LLM flail away at it could we give to an actual developer? It seems like we’ve finally realized that clear specs makes dev work much easier for LLMs. But the same is true for a human. The human will ask more clarifying questions and not hallucinate. The llm will role the dice and pick a path. Maybe we as devs would just rather talk with machines.
Yes, but the difference is that an LLM produces the result instantly, whereas a human might take hours or days.

So if you can get the spec right, and the LLM+agent harness is good enough, you can move much, much faster. It's not always true to the same degree, obviously.

Getting the spec right, and knowing what tasks to use it on -- that's the hard part that people are grappling with, in most contexts.

I'm using it to help me build what I want and learn how. It being incorrect and needing questioning isn't that bad, so long as you ARE questioning it. It has brought up so many concepts, parameters, etc that would be difficult to find and learn alone. Documentation can often be very difficult to parse. Llms make it easier.
> Maybe we as devs would just rather talk with machines.

This is kind of how I feel. Chat as an interaction is mentally taxing for me.

Separately, you have to consider that "wasting tokens spinning" might be acceptable if you're able to run hundreds of thousands of these things in parallel. If even a small subset of them translate to value, then you're far net ahead vs with a strictly manual/human process.
> hundreds of thousands of these things in parallel

At what cost,. monetary and environmental?

If the system provides value that is greater than its cost, then paying the cost to gain the value is always worthwhile - regardless of the magnitude of the cost.

As costs drop exponentially (a reasonable expectation for LLMs, etc.) then increasing agent parallelism becomes more and more economically viable over time.

>As costs drop exponentially

Not a reasonable expectation anymore. Moore's Law has been dead for more than a decade and we're getting close to physical limits.

I do the same thing with my engineers but I keep the tasks in Jira and I label them "stories".

But in all seriousness +1 can recommend this method.

This is built into Cursor now with plan mode https://cursor.com/docs/agent/planning
How does Cursor plan mode differ from Claude Code plan mode? I've used the latter a lot (it's been there a long time), and the description seems very similar. The big difference with the workflow I described is that with that plan mode you don't get to review and correct what happened between steps.
I've not used Claude Code, so my answer might not be that useful. But I would think that because both are chat-based interfaces you would be able to instruct the model to either continue without approval or wait for your approval at each step. I certainly do that with Cursor. Cursor has also recently started automatically generating TODO lists in the background (with a tool call I'm assuming), and displaying them as part of the thinking process without explicit instruction. I find that useful.
this plus a reset in between steps usually helps focus context in my experience
Yeah in my experience, LLMs are great but they still need babysitting lest they add 20k lines of code that could have been 2k.
There are two opposite ways to do this.

Codex is like an external consultant. You give it specs and it quietly putters away and only stops when the feature is done.

Claude is built more like a pair programmer, it displays changes live, "talks" about what it's doing and what's working et.

It's really, REALLY hard to abort codex mid-run to correct it. With Claude it's a lot easier when you see it doing something stupid or getting of the rails. Just hit ESC and tell it where it went wrong (like use task build, don't build it manually or use markdownlint, don't spend 5 minutes editing the markdown line by line).

I also use AI to do discrete, well-defined tasks so I can keep an eye on things before they go astray.

But I thought there are lots of agentic systems that loop back and ask for approval every few steps, or after every agent does its piece. Is that not the case?