Hacker News new | ask | show | jobs
by bob1029 4 days ago
I've been able to avoid context size issues by applying one simple constraint to my agent loop. What I do is prevent all tool calling in the user's top-level conversation thread. Anything that needs to tool call must happen in a recursive invoke of the agent, which returns whatever results to caller.

I can keep the same high level conversation going for an entire day over a million LOC+ codebase without ever hitting meaningful token limits. No compaction or summarization tricks needed. I can burn 50 million tokens in recursive calls and still not touch 100k tokens in my root conversation thread.

There is some rework needed to "bootstrap" the agent each time it has to descend back into Narnia, but this is still far more efficient than carrying around one big flat context that tries to cover everything all the time.

Recursion is very effective at controlling token use, but it can only go so far. I've not observed any uplift for recursive depth beyond 1. I have seen the agent attempt it a few times, but the practical performance is simply not there. External symbolic recursion does not appear to be something the frontier models have been trained for. They are fantastic at emulating recursion in context, but we don't want that if we are trying to achieve a reduction in token use.

10 comments

For anyone using Claude Code, ask it to do all the work in workflows (it has a tool for that), they released that feature together with Opus 4.8 and it also seems a bit better at doing long tasks as well. The main conversation just orchestrates the work at that point.
You can also just ask it to do work in a subagent. It will write a plan and launch the subagent to do the actual code, keeping it out of the main context.

In addition, you can co-author a plan for a biggish chunk of work, divided into stages, have it launch a subagent for phase 1 and check its work, then ESC-ESC to go back to just after you wrote the plan and have it do phase 2. Repeat until done. This keeps the overall goal in the main context for the review, but clears out previous reviews. Kind of like a workflow but with more control.

My problem with regular sub-agents is that after around 2-4 hours the main agent stops working on a task and asks for user input no matter how I tell it to continue autonomously until the ~5-15 stage plan is done, when when it has a clear plan that's made with the plan mode and instructions to continue autonomously.

It's happened multiple times where I give it a task before going to sleep and when I come back it's stuck halfway through on some stupid summary, where my only response needed is basically "Yeah, continue." even though I use Opus. Using workflows for the higher level planning helped with that and those annoying pauses no longer happen, perhaps due to the main conversation being much shorter and apparently not enough for the weights to nudge towards user confirmation.

Well, 2-4 hours of autonomous work is outside my comfort zone to start with. But have you tried Claude Code’s “auto mode”? I haven’t seen this premature stopping since it was introduced.
This makes intuitive sense. Can I ask what harness you're using that allows you to configure the constraint and how?
You can do this in opencode and pi (haven't used), by defining your own agents or overriding the built-in ones, so in your primary agent you can disable all tools and give it good instructions for how to delegate

I imagine most harnesses should have a way to do this today, if they don't, get a new one. OpenCode i.e. is highly customizable, Claude and VS Code both support a ton as well including custom agents (though unclear if you can create custom top-level in claude-code)

https://opencode.ai/docs/agents/

https://code.claude.com/docs/en/sub-agents

https://code.visualstudio.com/docs/agent-customization/custo...

Thanks, those don't deterministically prevent the main loop from using tools thought, unless I'm wrong that's just prompting the main agent on when to use specialized sub agents
you can configure tools, thinking, permissions et al on a per agent basis in the frontmatter, or via config (which they use in the examples), either location is valid, merging order (?)

the main agent would be very different, basically an orchestrator, and you are "loop engineering" it, and turning off all the things for this main agent besides being able to run subagents

for opencode:

https://opencode.ai/docs/agents/#permissions (what tools, mcp, etc...)

https://opencode.ai/docs/agents/#task-permissions (what subagents it can call)

https://opencode.ai/docs/agents/#additional (thinking effort)

It's a custom agent loop. There are no other parties involved here. Just vanilla C#/.NET and the OpenAI DLL.
I would also be really interested in seeing this if you’re willing to share it.
Are you going to open source it
Claude Code seems to automatically do this in some cases. It seems to have some heuristic "will eat a lot of context" where it decides to dispatch a sub agent.

I see it pretty frequently in troubleshooting and data analysis flows where it will dump the data collection and aggregation into a sub agent then pull out a summarized result.

I'll do something similar where I have the main agent maintain context in a design doc/markdown file and update as it goes along. Then I can clear/restart/handoff at will

Might depend on the model. Haiku doesn’t like to delegate unless you ask it to. I have a custom command for “delegate plan, delegate code, delegate review”, but launching it with Haiku gives me mediocre results.
I have a different way, but still trying to figure out how well it works. Instead of going into recursion, the agent is allowed to restart the thread by doing the summarize/debrief/reflect pass, writing key findings into persistent memory and rewriting the prompt whenever the context goes too large or it gets stuck. Recursion with TCO if you may.

In a way it's a generalization of the spec-driver approach, but in addition to the the formal spec the carryover buffer lives in the memory.

Kiro does this automatically from what I can tell using it
This is interesting to me because reducing context & token usage is in the user's best interest but not in the financial interest of AI vendors. I am not an expert but it sounds like your "one simple trick" would fix context issues and allow much tighter control over token usage. Thanks for being willing to share this tip in an HN comment, changing how those in the know use AI agents going forward -- it's hard to keep up!
The tokens are still being burnt, they're just doing so in a parallel dimension from the users main context window.
It's true that the initial tool response still has the same amount of tokens but it doesn't keep dragged along in the longer-lived top context.
Don't you resend after every turn, so splitting it avoids the n^2 token usage (granted it's cached so there's some optimal amount here)
Yes, exactly. You resend it on every turn (assuming no cache hits). This is why using the shorter-lived subagent to take in that context and only return the useful result back to the longer-lived context safes tokens.
The real benefit is being able to use a cheaper, but good enough, model with a specific system prompt dedicated to that task.
> This is interesting to me because reducing context & token usage is in the user's best interest but not in the financial interest of AI vendors.

AI vendors still need to compete with each other both in terms of token cost and competency. An agent that is costly and less effective by wasting tokens is less competitive.

> There is some rework needed to "bootstrap" the agent each time it has to descend back into Narnia

Makes me wonder if it would be best to have some sort of "fork" operation to start the new agent. Rather than starting from blank it inherits the existing context (which is already cached for evaluation) plus a bit on top for its specific task. Much like the system call there would essentially be two returns, the one in the agent says "You are the agent, perform the discussed work" and the parent gets the result produced by the agent.

I've actually tried something like that, with two tools:

* push("What I'm about to do"),

* pop("What I've achieved").

"Push" marks the position in current context after the call and returns "Proceed", "pop" erases everything after the matching call and replaces "Proceed" in it's result as what was passed to the argument to the "pop", effectively pruning long-winded head even inside one reasoning stream. In the end the model only sees how it decided to do work on something, and that it was already done, forgetting everything it between, except what it itself decided was important.

Gemma 4 31B QAT successfully uses it when navigating a maze, marking positions at intersections, exploring them, navigating back and pushing again if necessary. Smaller models often fail to mark positions and forget to backtrack as well, instead they try to rely on themselves to track their paths and navigate back (and also fail).

I think it should work for long-running deep research tasks, but I was too lazy to test it, because it all required a lot of code to glue this up, since most tools and libs are not designed to work like that, and now I'll need even more code to test it, without a purposeful task.

How do you get the agent to stick to it without constantly rejecting tool calls with the same description? I've tried a similar setup a number of times and it tends to forget about this constraint very quickly.
The tool itself enforces the constraint. This is deterministic. If an agent tries to read a big fat file in root, it gets an error from that tool's implementation that reiterates the requirement.

I don't bother warning it in the system prompt anymore. It's pointless. I let it bump its head as required. A few hundred tokens and the agent is back on track each time.

If the model isn't following the system/developer prompts easily, you might want to try a bigger/better model, tends to mostly be about model quality if it doesn't follow what you tell it to. Besides that, conflicting directions in the system/developer prompts can lead to the model seemingly ignoring instructions too.
So what does the top level thread look like? "Make foo() do bar" (Subagent invoked) "Job finished!"
The top level and N+1 looks like:

  [User] Actual human prompt
  [Agent] Attempted use of tool & hand slap
  [Agent] call(projection of user's prompt relative to discovered tool constraints)
    ["User"] Prompt from above call
    [Agent] Legal tool use
    [Agent] ... until satisfied 
    [Agent] return(summary that satisfies the prompt for this level of execution)
  [Agent] Additional call() invokes possible depending on returned summary
  [Agent] Final return(summary) from root ends this turn of conversation and user sees summary
  [User] Next turn of conversation initiated by actual human
How do you get something like this set up?
Which tools? Even file reads and writes?
Especially these things.

The only tools permissible to root in my scheme are call() and return().

Is it in pi.dev? Don't thinking tokens still take up context?