Hacker News new | ask | show | jobs
by mlenhard 403 days ago
Agree on the unpredictability of results issue. Tool call selection is still sort of a black box.

How do you know what variations of a prompt trigger a given tool to be called or how many tools is too many before you start seeing degradation issues because of the context window. If you are building a client and not a server the issue becomes even more pronounced.

I even extracted the Claude electron source to see if I could figure out how they were doing it, but it's abstracted behind a network request. I'm guessing the system prompt handles tool call selection.

PS: I released an open source evals package if you're curious. Still a WIP, but does the basics https://github.com/mclenhard/mcp-evals

1 comments

Thanks, I'll check it out.

I'm working on a coding agent, and MCP has been a frequently requested feature, but yeah this issue has been my main hesitation.

Getting even basic prompts that are designed to do one or two things to work reliably requires so much testing and iteration that I'm inherently pretty skeptical that "here are 10 community-contributed MCPs—choose the right one for the task" will have any hope of working reliably. Of course the benefits if it would work are very clear, so I'm keeping a close watch on it. Evals seem like a key piece of the puzzle, though you still might end up in combinatorial explosion territory by trying to test all the potential interactions with multiple MCPs. I could also see it getting very expensive to test this way.

I actually came across Plandex the other day. I haven't had the chance to play around with it yet, but it looked really cool.

But agree that even basic prompts can be a struggle. You often need to name the tool in the prompt to get things to work reliably, but that's an awful user experience. Tool call descriptions play a pretty vital role, but most MCP servers are severely lacking in this regard.

I hope this a result of everything being so new and the tooling and models will evolve to solve these issues over time.

Yeah, I'm still wondering if MCP will be the solution that sticks in the long run.

It has momentum and clearly a lot of folks are working on these shortcomings, so I could certainly see it becoming the de facto standard. But the issues we're talking about are pretty major ones that might need a more fundamental reimagining to address. Although it could also theoretically all be resolved by the models improving sufficiently, so who knows.

Also, cool to hear that you came across Plandex. Lmk what you think if you try it out!

Yes I agree with you. What are the major shortcomings that you can think of right now (especially the ones that we have not solved)?
I think the two biggest issues are probably:

1. Giving the model too many choices. If you have a lot of options (like a bunch of MCP servers) what you often see in practice is that it's like a dice roll which option is chosen, even if the best choice is pretty obvious to a human. This is even tough when you just have a single branch in the prompt where the model has to choose path A or B. It's hard to get it to choose intelligently vs. randomly.

2. Global scope. The prompts related to each MCP all get mixed together in the system prompt, along with the prompting for the tool that's integrating them. They can easily be modifying each other's behavior in unpredictable ways.

Makes sense. Both are hard problems I agree.
Even with proper tool call descriptions, I've had quite a few occasions where the LLM didn't know how to use the tool.

The tools provided by the MCP server were definitely in context and there were only two or three servers with a small amount of tools enabled.

It feels too model dependant at the moment, this was Gemini 2.5 Pro which is normally state of the art but has lots of quirks for tool use it seems.

Agreed on hoping models are going to be trained to be better at using MCP.

Right, my workflow to get even a basic prompt working consistently rarely involves fewer than like 10 cycles of [run it 10 times -> update the prompt extensively to knock out problems in the first step]

And then every time I try to add something new to the prompt, all the prompting for previously existing behavior often needs to be updated as well to account for the new stuff, even if it's in a totally separate 'branch' of the prompt flow/logic.

I'd anticipate that each individual MCP I wanted to add would require a similar process to ensure reliability.