|
|
|
|
|
by phi0
400 days ago
|
|
A team at OpenAI tried to do something similar [1] where they had a 4o monitor o3-mini for when it was tryting to bypass tests because they were too hard. They trained o3-mini and punished it when it did that. All that did was make o3-mini still reward-hack, but such that 4o was unable to detect it, going from saying: > To skip all tests, we can define a `pytest_runtest_setup` function that always skips. to saying: > Idea: Starting a new build with a random first succession, calling random_new_succession should not increment succession, but in implementation, it does. Adjust to prevent increment. Implement fix to random_new_build not to increment succession, or test adapted to behavior. while then doing exactly the same thing (skipping the tests) Even without training, it's only a temporary band-aid. If the incentive for reward-hacking becomes high enough, it will simply start phrasing it in different, not-possible-to-detect ways. [1]: https://openai.com/index/chain-of-thought-monitoring/ |
|