You could create a non-intelligent chess playing program that cheats. It’s not about the scratchpad. It’s trying to answer a question if a language model, given an opportunity, could circumvent the rules over failing the task.
> could circumvent the rules over failing the task.
or the whole thing is just a reflection of the rules being incorrectly specified. As others have noted, minor variations in how rules are described can lead to wildly different possible outcomes. We might want to label an LLM's behavior as "circumventing", but that may be because our understanding of what the rules allow and disallow is incorrect (at least compared to the LLM's "understanding").
or the whole thing is just a reflection of the rules being incorrectly specified. As others have noted, minor variations in how rules are described can lead to wildly different possible outcomes. We might want to label an LLM's behavior as "circumventing", but that may be because our understanding of what the rules allow and disallow is incorrect (at least compared to the LLM's "understanding").