| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by jasonlfunk 1191 days ago
	Is there a reason not to have another “unbroken” chat instance check the output for violations? It seems like a simple “does the following response violate your rules?” would stop most of these “jailbreaks”.

1 comments

taneq 1191 days ago

Is there a reliable way to 'escape' input? How would you stop the second instance from also being jailbroken by the prompt that tripped up the first instance?

link

astrange 1191 days ago

Build a different model architecture where the system prompt is a different head than the user prompt and is always equally weighted.

Maybe.

link