Hacker News new | ask | show | jobs
by RoboTeddy 892 days ago
It was certainly an unresolved question before they did this work!

Naively, it seems reasonable to believe that if you adjust all the weights of a neural net towards the behavior you want via SFT and RLHF, that it would compete with/mute/obscure undesired behavior like a back door. But it seems not to be so… Indeed the cute mask does not cover the entire shoggoth— it may still have tentacles (https://images.app.goo.gl/YW9g3BvwGqGwYTgd6)