Hacker News new | ask | show | jobs
by MickeyShmueli 106 days ago
the mock tool platform thing is smart. testing agents against real APIs is a nightmare, you get flakiness, you burn through rate limits, and you can't reproduce failures

one thing i'm curious about: how do you handle testing the tool selection logic itself? like the agent choosing WHICH tool to call is often where things break, not the tool execution

we had a support agent that would sometimes call the "refund order" tool when the user just wanted to check order status. the tool worked perfectly, the LLM just kept picking the wrong one. your mock platform lets you verify the tool returns the right data, but does it catch when the agent calls the wrong tool entirely?

also the full-session evaluation vs turn-by-turn is spot on. had a similar issue with a verification flow where each individual turn looked fine in langsmith but the overall flow was completely broken. you'd see "assistant asked for name" (good), "assistant asked for phone" (good), "assistant processed request" (good), but it never actually verified the phone number matched the account

tbh this feels like one of those problems that's obvious in hindsight but nobody builds the tooling for until they get burned in production

1 comments

In that case I think you can have a refund subagent that is responsible for checking if the user really asked for refund before doing these dangerous things. But it only minimize errors, LLMs are non-determinitic by nature.