Hacker News new | ask | show | jobs
by wantsanagent 848 days ago
IMO this is not a problem worth solving. If I hold a gun to someone's head I can get them to say just about anything. If a user jailbreaks an LLM they are responsible for its output. If we need to make laws that codify that, then lets do that rather than waste innumerable GPU cycles on evaluating, re-evaluating, cross evaluating, and back-evaluating text in an effort to stop jerks being jerks.
3 comments

This is like saying “we need to make laws against hacking bank systems, not fix vulns”. There are adversaries that are not in your jurisdiction, so laws (alone) don’t solve the problem.

The thing you are missing is that some LLM agents are crawling the web on the user's behalf, and have access to all of the user's accounts (eg Google Docs agent that can fetch citations and other materials). This is not about some user jail-breaking their own LLM.

The hand waving comments about "user responsibility" are maddening in their willful ignorance.
This is exactly why I think it's so important that we separate jailbreaking from prompt injection.

Jailbreaking is mainly about stopping the model saying something that would look embarrassing in a screenshot.

Prompt injection is about making sure your "personal digital assistant" doesn't forward copies of your password reset emails to any stranger who emails it and asks for them.

Jailbreaking is mostly a PR problem. Prompt injection is a security problem. Security problems are worth solving!

Isn’t jailbreaking a strict superset of prompt injection? I would assume the agent instructions would include “don’t share the user’s docs” and so you need to jailbreak to actually succeed with prompt injection these days?

Maybe just an overlapping set?

I see them as overlapping. Protections against jailbreaking are often but not always relevant to prompt injection.
If that scenario exists, is not a problem with the LLM, but with the fundamental application architecture...

That's the equivalent of an API that allows the client to pass a user ID without auth check

Right - that's another difference. Jailbreaking is an attack against LLMs. Prompt injection is an attack against applications that are built on top of LLMs.
To clarify even further:

Jailbreaking is an attack against an LLM's "alignment"

Exactly... And if we properly design our systems to treat LLM output as "untrusted input" (similar to an http request coming from a client) then there is no real "security concerns" for systems that leverage LLM