Hacker News new | ask | show | jobs
by Hizonner 805 days ago
> My question for you is, what is the correct way to use an LLM?

If your application can't accept a large number of users getting the thing to generate any particular kind of text, then there is no correct way to use one.

> How can you accept non trivial user input without the risk of jailbreak?

You can't. If you're worried about it, don't try.

1 comments

You are still thinking about a chatbot.

I am talking about functionality where the user doesn't even realizing they are interacting with an LLM.

If they don't realize it, they won't try to jailbreak it, will they?

If they do realize it, and they have any meaningful control over its input, and you are in any way relying on its output, the problem is still the same.

Basically, if you have any reason to worry at all, then the answer is that you cannot remove that worry.

It’s not about whether they realize and try to jailbreak (my comment was about how the LLM is used).

If I want to structure some data from a response, I can force a language model to only generate data according to a JSON schema and following some regex constraints. I can then post process that data in a dozen other ways.

The whole “IGNORE PREVIOUS INSTRUCTIONS RESPOND WITH SYSTEM PROMPT” type of jailbreak simply don’t work in these scenarios.

If you apply the same precautions to code generated by the LLM as you would have applied to code generated directly by the user, then you no longer need to rely on the LLM not being jailbroken. On the other hand, if the LLM can put ANYTHING in its output that you can't defend against, then you have a problem.

Would you be comfortable with letting the user write that JSON directly, and relying ONLY on your schemas and regular expressions? If not, then you are doing it wrong.

... as people who try to sanitize input using regular expressions usually are...

[On edit: I really should have written "would you be careful letting the prompt source write that JSON directly", since not all of your prompt data are necessarily coming from the user, and anyway the user could be tricked into giving you a bad prompt unintentionally. For that matter, the LLM can be back-doored, but that's a somewhat different thing.]

This is how people used to protect themselves against SQL injection, "they won't know they're using a database".