| I guess the interesting part is the forces tool writing part. Which kind of solves the when should we write a tool part by just saying always. But I think the question is how will this scale. The real core issue I feel like I’ve been encountering is scaling complexity. Reducing the number of tools without losing efficiency or capability. Reducing duplication, abstracting, cleaning up, and maintaining knowledge and memory. I think the issue for me has been threefold. 1. As the repo grows how does you make the agent keep understanding of it without excessive context pollution. 2. How do you maintain memory and knowledge over time. 3. How do you know the agent is performing better over time and not regressing as you evolve. And what has somewhat been working for me is A) trees or hierarchies. Trees scale well. Folder structure but also in the form of just simple indices. Logical structure and locality makes them even more effective. B) caching. Having the agents “cache” their thinking in the form of summaries, skills, tools. Recursive summarization really helped with mono repo navigation for me. But right now I still feel like I need to be constantly prompting them and I can’t quite close the feedback loop. |
(2) is where the structured capability format earns its keep over free-text memory. Triggers and suppression conditions give you inspectable, versioned invocation policy rather than prose that degrades over time. Still early though.
(3) I don't have a good answer to yet. Your point about feedback loops is the right framing — knowing whether the agent is actually getting better rather than just accumulating more tools is unsolved. The audit angle (administrators reasoning about which tools fire, when, and whether they should) is where I think this needs to go, but I haven't built that layer.
One thing that might directly address your caching point though — ADRs (Architecture Decision Records). The article that spawned Tendril started with giving an agent a record_decision capability that wrote ADRs to the filesystem. ADRs as agent cache is an interesting framing: structured, persistent, searchable records of why decisions were made at the moment they were made. That's arguably a better cache primitive than summarisation — decisions don't degrade the way summaries do, and they give you something to reason about for regression detection too.
Your tree/hierarchy observation resonates — the registry is a flat index right now which probably doesn't scale past a few dozen capabilities without some grouping structure.