| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by a-dub 483 days ago
	i wonder if it would be possible to insert canaries into the negative examples or otherwise detect when the inference trajectory enters the forbidden zone. otherwise fine tuning seems kinda setuid 0-ish. seems hard to harden it against adversarial "bad influences." a real system would probably either try to freeze weights related to the guardrails or reapply the alignment stuff after or with the user stuff.