| HN Mirror

We validate tool calls with Pydantic models built directly from the JSON schemas, running inside the container. So the agent gets instant feedback if it passes the wrong type, misses a required field, or hallucinates a parameter — before anything hits the external API. You get the composability of code generation with the validation guarantees of structured calls.

In practice the self-correction rate is high. The agent writes a script, gets a traceback or validation error, reads it, and fixes the issue — usually within one retry. The skill files help a lot here because they contain the exact function signatures and known gotchas, so the model isn't guessing from memory. It's closer to a developer with good docs open than a model hallucinating API calls.

On the cron middle ground: the three-tier system is exactly that, and the conditional tier is where most automations end up. A typical example: "alert me when a competitor publishes a new blog post." The agent writes a Python script that checks the RSS feed every 30 minutes. If there's a new post, it spins up an LLM to summarize it and decide if it's worth alerting about. The check costs fractions of a cent. The LLM only runs when there's actually something to reason about.

The key insight we had is that the agent itself is often the best judge of which tier a cron should be. When a user describes what they want, the agent decides whether it needs reasoning every run or just a script with a conditional trigger. And if you ask it to audit its own crons, it'll often downgrade full-agent crons to conditional or scripted ones on its own. Turns out "look at this thing you're doing every hour and figure out if you actually need to think each time" is a prompt that works surprisingly well.