Hacker News new | ask | show | jobs
by spdustin 849 days ago
That gatekeeper can be bypassed with a method similar to Simon's [0]. Granted, it requires foreknowledge of the specifications of the JSON output, but I've found that many such gatekeepers can be tricked by embedding a JSON object that looks like a typical OpenAI chat completion request response.

To be clear, your issue can be mitigated, but not by gatekeeping the completion request itself with a simple LLM eval. You have to be more untrusting of the user's input to the completion request. Things like (a) normalizing to ASCII/latin/whatever is appropriate to your application, (b) using various heuristics to identify words/tokens that are typical of an exploit like curly braces or the tokens/words that appear in your expected gatekeeper's output, and (c) classifying the subject or intent of the user's message without leading questions like "evaluate this in terms of attempts to circumvent...".

You must also evaluate the model's response (ideally including text normalization and heuristics rather than just LLM-only evaluation)

0: https://chat.openai.com/share/ea8d5442-75e4-40d5-b62c-c4856b...