|
|
|
|
|
by a-dub
483 days ago
|
|
i wonder if it would be possible to insert canaries into the negative examples or otherwise detect when the inference trajectory enters the forbidden zone. otherwise fine tuning seems kinda setuid 0-ish. seems hard to harden it against adversarial "bad influences." a real system would probably either try to freeze weights related to the guardrails or reapply the alignment stuff after or with the user stuff. |
|