|
|
|
|
|
by mattswulinski
110 days ago
|
|
The "treat your context window like RAM" framing resonates. We've been running into this exact tension building agentic workflows with Claude Code - the more tools you make available, the worse selection accuracy gets, even with very capable models. Curious what others have found: does the code-generation approach to tool calling (agent writes Python instead of picking from JSON schemas) actually hold up at scale? It seems elegant for composition, but I'd worry about hallucinated function names or incorrect arguments being harder to catch than a malformed structured call. With JSON schemas you at least get validation for free. Also interested in the "use intelligence once to create automation that runs forever without intelligence" pattern for cron jobs. Has anyone found a good middle ground between fully scripted automations and full LLM-every-loop? The cost blowup they describe ($5k/month from a 5-minute cron) seems like it would kill most production deployments before they prove value. |
|
In practice the self-correction rate is high. The agent writes a script, gets a traceback or validation error, reads it, and fixes the issue — usually within one retry. The skill files help a lot here because they contain the exact function signatures and known gotchas, so the model isn't guessing from memory. It's closer to a developer with good docs open than a model hallucinating API calls.
On the cron middle ground: the three-tier system is exactly that, and the conditional tier is where most automations end up. A typical example: "alert me when a competitor publishes a new blog post." The agent writes a Python script that checks the RSS feed every 30 minutes. If there's a new post, it spins up an LLM to summarize it and decide if it's worth alerting about. The check costs fractions of a cent. The LLM only runs when there's actually something to reason about.
The key insight we had is that the agent itself is often the best judge of which tier a cron should be. When a user describes what they want, the agent decides whether it needs reasoning every run or just a script with a conditional trigger. And if you ask it to audit its own crons, it'll often downgrade full-agent crons to conditional or scripted ones on its own. Turns out "look at this thing you're doing every hour and figure out if you actually need to think each time" is a prompt that works surprisingly well.