| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by adastra22 314 days ago

...

I also have a task.md workflow that I'm actively iterating on, and is the one that I get it working autonomously for a half hour to an hour and am often surprised at finding very good results (but sometimes very terrible results) at the end of it. I'm not going to release this one because, frankly, I'm starting to realize there might be a product around this and I may move on that (although this is already a crowded space). But I don't mind outlining in broad strokes how it works (hand-summarized, very briefly):

""" You are a senior software engineer in a leadership role, directing junior engineers and research specialists (your subagents) to perform the task specified by the user.

1. If PLAN.md exists, read its contents and skip to step 4.

2. Without making any tool calls, consider the task as given and extrapolate the underlying intent of the user. [A bunch of rules and conditions related to this first part -- clarify the intent of the user without polluting the context window too much]

3. Call the software-architect agent with the reformulated user prompt, and with clear instructions to investigate how the request would be implemented on the current code base. The agent is to fill its context window with the portions of the codebase and developer documentation in this repo relevant to its task. It should then generate and report a plan of action. [Elided steps involving iterating on that plan of action with the user, and various subagents to call out to in order to make sure the plan is appropriately sequenced in terms of dependent parts, chunked into small development steps, etc. The plan of action is saved in PLAN.md in the root of the repository.]

4. While there are unfinished todos in the PLAN.md document, repeat the following steps:

a) Call rust-engineer to implement the next todo and/or verify completion of the todo.

b) Call each of the following agents with instructions to focus on the current changes in the workspace. If any actionable items are found in the generated report that are within the scope of the requested task, call rust-engineer to address these items and then repeat:

- rust-nit-checker [checks for things I find Claude gets consistently wrong in Rust code]

- test-completeness-checker [checks for missing edge cases or functionality not tested]

- code-smell-checker [a variant of the software architect agent that reports when things are generally sus]

- [... a handful of other custom agents; I'm constantly adjusting this list]

- dirty-file-checker [reports any test files or other files accidentally left and visible to git]

c) Repeat from step a until you run through the entire list of agents without any actionable, in-scope issues identified in any of the reports & rust-engineer still reports the task as fully implemented.

d) Run git-commit-auto agent [A variation of the earlier git commit script that is non-interactive.]

e) Mark the current todo as done in PLAN.md

5. If there are any unfinished todo in PLAN.md, return to step 4. Otherwise call software-architect agent with the original task description as approved by the user, and request it to assess whether the task is complete, and if not to generate a new PLAN.md document.

6. If a new PLAN.md document is generated, return to step 4. Otherwise, report completion to the user. """

That's my current task workflow, albeit with a number of items and agent definitions elided. I have lots of ideas for expanding it further, but I'm basically taking an iterative and incremental approach: every time Claude fumbles the ball in an embarrassing way (which does happen!), I add or tweak a rule to avoid that outcome. There are a couple of key points:

1) Using Rust is a superpower. With guidance to the agent about what crates to use, and with very strict linting tools and code checking subagents (e.g. no unsafe code blocks, no #[allow(...)] directives to override the linter, an entire subagent dedicated to finding and calling out string-based typing and error handling, etc.) this process produces good code that largely works and does what it was requested to do. You don't have to load the whole project in context to avoid pointer or use-after-free issues, and other things that cause vibe coded project to fail at a certain complexity. I don't see this working in a dynamic language, for example, even though LLMs are honestly not as good at Rust as they are in more prominent languages.

2) The key part of the task workflow is the long list of analysts to run against the changes, and the assumption that works well in practice that you can just keep iterating and fixing reported issues (with some of the elided secret sauce having to do with subagents to evaluate whether an issue is in scope and needs to be fixed or can be safely ignored, and keeping on eye out for deviations from the requested task). This eventual completeness assumption does work pretty well.

3) At some point the main agent's context window gets poisoned, or it reaches the full context window and compacts. Either way this kills any chance of simply continuing. In the first case (poisoning) it loses track of the task and ends up caught in some yak shaving rabbit hole. Usually it's obvious when you check in that this is going on, and I just nuke it and start over. In the latter case (full context window) the auto-compaction also pretty thoroughly destroys workflow but it usually results in the agent asking a variation on "I see you are in the middle of ... What do you want to do next?" before taking any bad action to the repo itself. Clearing the now poisoned context window with "/reset" and then providing just "task: continue" gets it back on track. I have a todo item to automate this, but the Claude Code API doesn't make it easy.

4) You have to be very explicit about what can and cannot be done by the main agent. It is trained and fine-tuned to be an interactive, helpful assistant. You are using it to delegate autonomous tasks. That requires explicit and repeated instructions. This is made somewhat easier by the fact that subagents are not given access to the user -- they simply run and generate reports for the calling agent. So I try to pack as much as I can in the subagents and make the main agent's role very well defined and clear. It does mean that you have to manage out of band communication between agents (e.g. the PLAN.md document) to conserve context tokens.

If you try this out, please let me know how it goes :)

1 comments

kami23 314 days ago

I tried this tonight as my first time using anything like Claude code, and having a week or so of copilot agentic mode experience.

It's the right path, I'm very smitten with seeing the sub agents working together. Blew through the Pro quota really fast.

I was a skeptic and am no more. Gonna see what it takes to run something basic in a home lab, and how the performance is, even if it is incredibly slow on a beefy home system, just checking in on it should be low enough friction for it to noodle on some hobby projects.

adastra22 314 days ago

Glad it worked for you :)

Yeah it was a "HOLY SHIT" moment for me when I first started experimenting with subagents. A step-change improvement in productivity for sure. They combine well together with Claude Code's built-in todo tool, and together really start to deliver on the promised goal of automating development. Watching it delegate to subagents and then seeing the flow of information back and forth is amazing.

One thing I forgot to mention -- I run Claude within a simple sandboxed dev container like this: https://github.com/maaku/agents/tree/main/.devcontainer This allows to safely run with '--dangerously-skip-permissions' which basically gives Claude free reign within the docker container in which it is running. This is what lets you run without user interaction.

When you say "run something basic in a home lab" do you mean local inference? Qwen3-Coder is probably the model to use if you want to go that route. Avoid gpt-oss as they used synthetic data in their training and it is unlikely to perform well.

I'm investigating this as well as I need local inference for some sensitive data. But honestly, the anthropic models work so well that I justified getting myself the unlimited/max plan and I mostly use that. I suspect I overbought -- at $200/mo I have yet to ever be rate limited, even with these long-running instances. I stay within the ToC and only run 1-2 sessions at a time though.