|
|
|
|
|
by RoboTeddy
892 days ago
|
|
It was certainly an unresolved question before they did this work! Naively, it seems reasonable to believe that if you adjust all the weights of a neural net towards the behavior you want via SFT and RLHF, that it would compete with/mute/obscure undesired behavior like a back door. But it seems not to be so… Indeed the cute mask does not cover the entire shoggoth— it may still have tentacles (https://images.app.goo.gl/YW9g3BvwGqGwYTgd6) |
|