| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by zachdotai 133 days ago
	Tool invocation. Each time the agent emits a tool call, the evaluator assesses it against the original task intent plus a rolling window of recent tool results. We tried coarser units (plan nodes, full steps) but drift compounds fast, by the time a step finishes, the agent may have already chained 3-4 bad calls. Tool-level gives the tightest correction loop. The cost is ~200ms latency per invocation. For hot paths we sample (every 3rd call, or only on tool-category changes) rather than evaluate exhaustively.

1 comments

nordic_lion 132 days ago

That makes sense binding to the smallest viable control surface, and the sampling strategy for hot paths sounds like a pragmatic balance between latency and coverage. Thanks for the additional feedback here.

link