Hacker News new | ask | show | jobs
by jasonlfunk 1191 days ago
Is there a reason not to have another “unbroken” chat instance check the output for violations? It seems like a simple “does the following response violate your rules?” would stop most of these “jailbreaks”.
1 comments

Is there a reliable way to 'escape' input? How would you stop the second instance from also being jailbroken by the prompt that tripped up the first instance?
Build a different model architecture where the system prompt is a different head than the user prompt and is always equally weighted.

Maybe.