| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by trifurcate 1242 days ago
	> The next step in defense is to tune a smaller LLM model to detect when output contains substantial repetition of the instructions, even in encoded form, or when the prompt appears designed to elicit such an encoding. For which you would use a meta-attack to bypass the smaller LM or exfiltrate its prompt? :-)