Hacker News new | ask | show | jobs
by a-dub 483 days ago
i wonder if it would be possible to insert canaries into the negative examples or otherwise detect when the inference trajectory enters the forbidden zone.

otherwise fine tuning seems kinda setuid 0-ish. seems hard to harden it against adversarial "bad influences." a real system would probably either try to freeze weights related to the guardrails or reapply the alignment stuff after or with the user stuff.