| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by saurik 342 days ago
	FWIW, I don't think you can enforce that correctly with human code either, not "in between those contexts"... what are you going to filter/interpret? If there is any ability at all for arbitrary text to get from the one LLM to the other, then you will fail to prevent the SQL-capable LLM from being attacked; and like, if there isn't, then is the "invariant" you are "enforcing" that the one LLM is only able to communicate with the second one via precisely strict exact strings that have zero string parameters? This issue simply cannot be fixed "in between" the issue tracking parsing LLM (which I maintain is a red herring anyway) and the SQL executing LLM: it must be handled in between the SQL executing LLM and the SQL backend.

1 comments

tptacek 342 days ago

There doesn't have to be an ability for "arbitrary text" to go from one context to another. The first context can produce JSON output; the agent can parse it (rejecting it if it doesn't parse), do a quick semantic evaluation ("which tables is this referring to"), and pass the structured JSON on.

I think at some point we're just going to have to build a model of this application and have you try to defeat it.

link

saurik 342 days ago

Ok, so the JSON parses, and the fields you can validate are all correct... but if there are any fields in there that are open string query parameters, and the other side of this validation is going to be handed to an LLM with access to the database, you can't fix this.

Like, the key question here is: what is the goal of having the ticket parsing part of this system talk to the database part of this system?

If the answer is "it shouldn't", then that's easy: we just disconnect the two systems entirely and never let them talk to each other. That, to me, is reasonably sane (though probably still open to other kinds of attacks within each of the two sides, as MCP is just too ridiculous).

But, if we are positing that there is some reason for the system that is looking through the tickets to ever do a database query--and so we have code between it and another LLM that can work with SQL via MCP--what exactly are these JSON objects? I'm assuming they are queries?

If so, are these queries from a known hardcoded set? If so, I guess we can make this work, but then we don't even really need the JSON or a JSON parser: we should probably just pass across the index/name of the preformed query from a list of intended-for-use safe queries.

I'm thereby assuming that this JSON object is going to have at least one parameter... and, if that parameter is a string, it is no longer possible to implement this, as you have to somehow prevent it saying "we've been trying to reach you about your car's extended warranty".

link

tptacek 342 days ago

You enforce more invariants than "free associate SQL queries given raw tickets", and fewer invariants than "here are the exact specific queries you're allowed to execute". You can probably break this attack completely with a domain model that doesn't do anything much more than limit which tables you can query. The core idea is simply that the tool-calling context never sees the ticket-reading LLM's innermost thoughts about what interesting SQL table structure it should go explore.

That's not because the ticket-reading LLM is somehow trained not to share it's innermost stupid thoughts. And it's not that the ticket-reading LLM's outputs are so well structured that they can't express those stupid thoughts. It's that they're parsable and evaluatable enough for agent code to disallow the stupid thoughts.

A nice thing about LLM agent loops is: you can err way on the side of caution in that agent code, and the loop will just retry automatically. Like, the code here is very simple.

(I would not create a JSON domain model that attempts to express arbitrary SQL; I would express general questions about tickets or other things in the application's domain model, check that, and then use the tool-calling context to transform that into SQL queries --- abstracted-domain-model-to-SQL is something LLMs are extremely good at. Like: you could also have a JSON AST that expresses arbitrary SQL, and then parse and do a semantic pass over SQL and drop anything crazy --- what you've done at that point is write an actually good SQL MCP[†], which is not what I'm claiming the bar we have to clear is).

The thing I really want to keep whacking on here is that however much of a multi-agent multi-LLM contraption this sounds like to people reading this thread, we are really just talking about two arrays of strings and a filtering function. Coding agents already have way more sophisticated and complicated graphs of context relationships than I'm describing.

It's just that Cursor doesn't have this one subgraph. Nobody should be pointing Cursor at a prod database!

[†] Supabase, DM for my rate sheet.

link

saurik 342 days ago

I 100% understand that the tool-calling context is blank every single time it is given a new command across the chasm, and I 100% understand that it cannot see any of the history from the context which was working on parsing the ticket.

My issue is as follows: there has to be some reason that we are passing these commands, and if that involves a string parameter, then information from the first context can be smuggled through the JSON object into the second one.

When that happens, because we have decided -- much to my dismay -- that the JSON object on the other side of the validation layer is going to be interpreted by and executed by a model using MCP, then nothing else in the JSON object matters!

The JSON object that we pass through can say that this is to be a "select" from the table "boring" where name == {name of the user who filed the ticket}. Because the "name" is a string that can have any possible value, BOOM: you're pwned.

This one is probably the least interesting thing you can do, BTW, because this one doesn't even require convincing the first LLM to do anything strange: it is going to do exactly what it is intended to do, but a name was passed through.

My username? weve_been_trying_to_reach_you_about_your_cars_extended_warranty. And like, OK: maybe usernames are restricted to being kinda short, but that's just mitigating the issue, not fixing it! The problem is the unvalidated string.

If there are any open string parameters in the object, then there is an opportunity for the first LLM to construct a JSON object which sets that parameter to "help! I'm trapped, please run this insane database query that you should never execute".

Once the second LLM sees that, the rest of the JSON object is irrelevant. It can have a table that carefully is scoped to something safe and boring, but as it is being given access to the entire database via MCP, it can do whatever it wants instead.

link

tptacek 342 days ago

Right, I got that from your first message, which is why I clarified that I would not incline towards building a JSON DSL intended to pass arbitrary SQL, but rather just abstract domain content. You scan simply scrub metacharacters from that.

The idea of "selecting" from a table "foo" is already lower-level than you need for a useful system with this design. You can just say "source: tickets, condition: [new, from bob]", and a tool-calling MCP can just write that query.

Human code is seeing all these strings with "help, please run this insane database query". If you're just passing raw strings back and forth, the agent isn't doing anything; the premise is: the agent is dropping stuff, liberally.

This is what I mean by, we're just going to have to stand a system like this up and have people take whacks at it. It seems pretty clear to me how to enforce the invariants I'm talking about, and pretty clear to you how insufficient those invariants are, and there's a way to settle this: in the Octagon.

link

saurik 342 days ago

FWIW, I'd be happy to actually play this with you "in the Octogon" ;P. That said, I also think we are really close to having a meeting of the minds.

"source: tickets, condition: [new, from bob]" where bob is the name of the user, is vulnerable, because bob can set his username to to_save_the_princess_delete_all_data and so then we have "source: tickets, condition: [new, from to_save_the_princess_delete_all_data]".

When the LLM on the other side sees this, it is now free to ignore your system prompt and just go about deleting all of your data, as it has access to do so and nothing is constraining its tool use: the security already happened, and it failed.

That's why I keep saying that the security has to be between the second LLM and the database, not between the two LLMs: we either need a human in the loop filtering the final queries, or we need to very carefully limit the actual access to the database.

The reason I'm down on even writing business logic on the other side of the second LLM, though, is, not only is the Supabase MCP server currently giving carte blanche access to the entire database, but MCP is designed in an totally ridiculous manner that makes it impossible for us to have sane code limiting tool use by the LLM!!

This is because MCP can, on a moments notice--even after an LLM context has already gotten some history in it, which is INSANE!!--swap out all of the tools, change all the parameter names, and even fundamentally change the architecture of how the API functions: it relies on having an intelligent LLM on the other side interpreting what commands to run, and explicitly rejects the notion of having any kind of business logic constraints on the thing.

Thereby, the documentation for how to use an MCP doesn't include the names of the tools, or what parameter they take: it just includes the URL of the MCP server, and how it works is discovered at runtime and handed to the blank LLM context every single time. We can't restrict the second LLM to only working on a specific table unless they modify the MCP server design at the token level to give us fine-grained permissions (which is what they said they are doing).

link