|
|
|
|
|
by MickeyShmueli
106 days ago
|
|
the mock tool platform thing is smart. testing agents against real APIs is a nightmare, you get flakiness, you burn through rate limits, and you can't reproduce failures one thing i'm curious about: how do you handle testing the tool selection logic itself? like the agent choosing WHICH tool to call is often where things break, not the tool execution we had a support agent that would sometimes call the "refund order" tool when the user just wanted to check order status. the tool worked perfectly, the LLM just kept picking the wrong one. your mock platform lets you verify the tool returns the right data, but does it catch when the agent calls the wrong tool entirely? also the full-session evaluation vs turn-by-turn is spot on. had a similar issue with a verification flow where each individual turn looked fine in langsmith but the overall flow was completely broken. you'd see "assistant asked for name" (good), "assistant asked for phone" (good), "assistant processed request" (good), but it never actually verified the phone number matched the account tbh this feels like one of those problems that's obvious in hindsight but nobody builds the tooling for until they get burned in production |
|