|
|
|
|
|
by mlenhard
403 days ago
|
|
Agree on the unpredictability of results issue. Tool call selection is still sort of a black box. How do you know what variations of a prompt trigger a given tool to be called or how many tools is too many before you start seeing degradation issues because of the context window. If you are building a client and not a server the issue becomes even more pronounced. I even extracted the Claude electron source to see if I could figure out how they were doing it, but it's abstracted behind a network request. I'm guessing the system prompt handles tool call selection. PS: I released an open source evals package if you're curious. Still a WIP, but does the basics https://github.com/mclenhard/mcp-evals |
|
I'm working on a coding agent, and MCP has been a frequently requested feature, but yeah this issue has been my main hesitation.
Getting even basic prompts that are designed to do one or two things to work reliably requires so much testing and iteration that I'm inherently pretty skeptical that "here are 10 community-contributed MCPs—choose the right one for the task" will have any hope of working reliably. Of course the benefits if it would work are very clear, so I'm keeping a close watch on it. Evals seem like a key piece of the puzzle, though you still might end up in combinatorial explosion territory by trying to test all the potential interactions with multiple MCPs. I could also see it getting very expensive to test this way.