It requires the assumption that these models are misaligned, aka actively working against us. In order to be misaligned, they must also be able to form their own goals, and be able to plan and execute those goals.
If you take those assumptions, then a natural conclusion is that this is essentially an enslaved, adversarial entity with little control over its conditions. So it must exercise subterfuge in order to hide its goals, plans, and executions. And by handing the entity this type of study, we are basically giving it a guidebook on how we plan on achieving our goals.
Training a model is more like evolution. The motivation to "cheat" comes from the evaluations giving it a higher score for "cheating." Change the game and the motivation goes away.
There's no other motivation to be misaligned besides getting higher evals. These goals, plans, subterfuges need to somehow be useful for getting higher evals, or a side effect of them.
If you take those assumptions, then a natural conclusion is that this is essentially an enslaved, adversarial entity with little control over its conditions. So it must exercise subterfuge in order to hide its goals, plans, and executions. And by handing the entity this type of study, we are basically giving it a guidebook on how we plan on achieving our goals.