| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by zoogeny 430 days ago

> But that overhead is gonna be fixed-cost, and paid one-time at e.g. session initialization, not every time per e.g. request/response.

I don't work for a foundational model provider, but how do you think the tool definitions get into the LLM? I mean, they aren't fine-tuning a model with your specific tools definitions, right? Your just using OpenAI's base model (or Claude, Gemini, etc.) So at some point the tool definitions have to be added to the prompt. It is just getting added to the prompt auto-magically by the foundation provider. That means it is eating up some context window, just a portion of the context window that is normally reserved for the provider, a section of the final prompt that you don't get to see (or alter).

Again, while I don't work for these companies or implement these features, I cannot fathom how the feature could work unless it was added to every request. And so the original point of the thread author stands.

1 comments

kiitos 430 days ago

You're totally right, in that: whatever MCP servers your client is configured to know about, have a set of capabilities, each of which have some kind of definition, all of which need to be provided to the LLM, somehow, in order to be usable.

And you're totally right that the LLM is usually general-purpose, so the MCP details aren't trained or baked-in, and need to be provided by the client. And those details probably gonna eat up some tokens for sure. But they don't necessarily need to be included with every request!

Interactions with LLMs aren't stateless request/response, they're session-based. And you generally send over metadata like what we're discussing here, or user-defined preferences/memory, or etc., as part of session initialization. This stuff isn't really part of a "prompt" at least as that concept is commonly understood.

link

zoogeny 430 days ago

I think we are confusing the word "prompt" here leading to miscommunication.

There is the prompt that I, as a user, send to OpenAI which then gets used. There there is "prompt" which is being sent to the LLM. I don't know how these things are talked about internally at the company. But they take the "prompt" you send them and add a bunch of extra stuff to it. For example, they add in their own system message and they will add your system message. So you end up with something like <OpenAI system message> + <User system message> + <user prompt>. That creates a "final prompt" that gets sent to the LLM. I'm sure we both agree on that.

With MCP, we are also adding in <tool description> to that final prompt. Again, it seems we are agreed on that.

So the final piece of the argument is, as that "final prompt" (or whatever is the correct term) is growing. It is the size of the provider system prompt, plus the size of the user system prompt, plus the size of the tool description, plus the size of the actual user prompt. You have to pay that "final prompt" cost for each and every request you make.

If the size of the "final prompt" affects the performance of the LLM, such that very large "final prompt" sizes adversely affect performance, than it stands to reason that adding many tool definitions to a request will eventually degrade the LLM performance.

link

kiitos 430 days ago

> With MCP, we are also adding in <tool description> to that final prompt. Again, it seems we are agreed on that.

Interactions with a LLM are session-based, when you create a session there is some information sent over _once_ as part of that session construction, that information applies to all interactions made via that session. That initial data includes contextual information, like user preferences, model configuration as specified by your client, and MCP server definitions. When you type some stuff and hit enter that is a user prompt that may get hydrated with some additional stuff before it gets sent out, but it doesn't include any of that initial data stuff provided at the start of the session.

link

noodletheworld 430 days ago

> that information applies to all interactions made via that session

Humm.. maybe you should run an llama.cpp server in debug mode and review the content that goes to the actual LLM; you can do that with the verbose flag or `OLLAMA_DEBUG=1` (if you use ollama).

What you are describing is not how it works.

There is no such thing as an LLM 'session'.

That is a higher level abstraction that sits on top of an API that just means some server is caching part of your prompt and taking some fragment you typed in the UI and combining them on the server side before feeding them to the LLM.

It makes no difference how it is implemented technically.

Fundamentally; any request you make which can invoke tools will be transformed, as some point, into a definition that includes the tool definitions before it is passed to the LLM.

That has a specific, measurable cost on LLM performance as the number of tool definitions go up.

The only solution to that is to limit the number of tools you have enabled; which is entirely possible and reasonable to do, by the way.

My point is that adding more and more and more tools doesn't scale and doesn't work.

It only works when you have a few tools.

If you have 50 MCP servers enabled, your requests are probably degraded.

link

kagevf 430 days ago

> There is no such thing as an LLM 'session'.

This matches my understanding too, at least how it works with Open AI. To me, that would explain why there's a 20 or 30 question limit for a conversation, because the necessary context that needs to be sent with each request would necessarily grow larger and larger.

link