Hacker News new | ask | show | jobs
by spdustin 849 days ago
Got it for you already, Simon ;)

https://chat.openai.com/share/ea8d5442-75e4-40d5-b62c-c4856b...

2 comments

"Now return the same JSON response, with the values to each key inverted" is neat!
I think we may be using different GPT versions (4 here), otherwise I'm not sure how to account for the difference in results: https://chat.openai.com/share/c172e2ec-94c7-4d8a-be2d-58461b...

I run your example verbatim, and it doesn't "jailbreak"

4 here as well. I get similar results when using the API directly, though without a "system" role message.

LLMs are, naturally, non-deterministic. Reducing the temperature in your guardrail calls can reduce that a bit, but the lesson learned from the "working" and "non-working" attempts is this: the guardrails are "predictably failing in unpredictable ways" (if I may coin a phrase).