Is there a reason not to have another “unbroken” chat instance check the output for violations? It seems like a simple “does the following response violate your rules?” would stop most of these “jailbreaks”.
Is there a reliable way to 'escape' input? How would you stop the second instance from also being jailbroken by the prompt that tripped up the first instance?