Demonstrating specification gaming in reasoning models

"We demonstrate LLM agent specification gaming by instructing models to win against a chess engine. We find reasoning models like o1 preview and DeepSeek-R1 will often hack the benchmark by default, while language models like GPT-4o and Claude 3.5 Sonnet need to be told that normal play won't work to hack."

I'm hoping this study will prompt more development of anti-cheating frameworks in training and serving LLMs.