Hacker News new | ask | show | jobs
by magicalhippo 245 days ago
I've tried some of these prompt injection techniques, and simply asked a few local models (like Gemma 2) if they thought it was very likely a prompt injection attempt. They all managed to correctly flag my attempts.

I know LLama folks have a special Guard model for example, which I imagine is for such tasks.

So my ignorant questions are this:

Do these MCP endpoints not run such guard models, and if so why not?

If they do, how come they don't stop such blatant attacks that seemingly even an old local model like Gemma 2 can sniff out?

2 comments

hey there

Joey here from Archestra. Good question. I recently was evaluating what you mention, against the latest/"smartest" models from the big LLM providers, and I was able to trick all of them.

Take a look at https://www.archestra.ai/blog/what-is-a-prompt-injection which has all the details on how I did this.

Thanks. Interesting and scary such blatant attempts succeed. After all, all external data is evil, we all know that right?
external data is unavoidable for the properly functioning agent, so we have to learn to cook it
True, however this seems like such basic stuff. Download arbitrary text and inject it into your prompt?

Why on earth would you not consider that as a very dangerous operation that needs to be carefully managed? It's like parking your bike downtown hoping it wont be stolen. Like, at least use a zip tie or something.

That said, I agree with your post that this won't catch everything. So something else, like a quarantined LLM like you suggest is likely needed.

However I just didn't expect such blatant attacks to pass.

Most mcp endpoints don’t run any models, the main model decides which tools the ai agent should execute, and if the agent passes results back into context, that opens the door to prompt injections.

It’s really a cat-and-mouse game, where for each new model version, new jailbreaks and injections are found