It was a famously hard task. It was an ingenious idea for an unexpected task that falls outside of the bounds of predictable normal input but is still readily comprehended by the public.
Unfortunately, as soon as it's a famously hard task trainers know they need to succeed at it and it loses a lot of the power to detect correctness.
Maybe this is an example of training overfit. But it won't be too long before local models chew through the "famously hard tasks". Except possibly ARC-AGI. That's one benchmark that is still developing with capabilities. And every time a new ARC-AGI benchmark is released it make the SOTA LLMs look pathetic. Because there is very little understanding or transferability with LLMs. But in terms of benchmark-able micro tasks, the local LLMs are improving.
Unfortunately, as soon as it's a famously hard task trainers know they need to succeed at it and it loses a lot of the power to detect correctness.