|
|
|
|
|
by ctoth
317 days ago
|
|
Can someone explain to me how "preventative steering" isn't an implementation of the most-forbidden technique? This sounds a lot like interpretability-guided training optimization, which I thought was a big big big no no. It will still introduce optimization pressure no? My understanding is that you shouldn't use insights gained from interpretability to feed back into your training process at risk of losing the interpretability in the first place. |
|
Because v is frozen, the optimiser still minimises the ordinary task loss; there’s no feedback loop that could re-encode the trait in some opaque basis. Empirically, Fig. 7B shows this keeps evil/sycophancy/hallucination near baseline while MMLU stays ~flat.
Caveats the authors themselves note: single-layer steering doesn’t always wipe the trait, so they try all-layer steering in App. J.3, which works better without hurting accuracy. They also tried a true regularization loss on the projection and found it did hide the signal elsewhere, i.e. the failure mode you’re worried about.
So it’s closer to “bias injection” than to “optimize on the probe,” which is why they argue it avoids the classic interpretability-collapse problem.