Hacker News new | ask | show | jobs
by Turn_Trout 331 days ago
As someone who did their PhD in RL and alignment, it was not obvious to me a priori if, or when, or how badly obfuscation would be a problem. Yes, it's been predicted (and was predicted significantly before that Zvi post). But many other alignment fears have been _predicted_, and those didn't actually happen.

I don't think the existence of specification gaming in unrelated settings was strong evidence that obfuscation would occur in modern CoT supervision. Speculatively, I think CoT obfuscation happens due to the internal structure of LLMs and it being inductively "easier" to reweight model circuits to not admit wrongthink, rather than to rewire circuits to solve problems in entirely different ways.