No, and they haven't been for at least half a year. Utterly optimized for by the providers. Nowadays if a model would be SotA for general use but not #1 on any of these benchmarks, I doubt they'd even release it.
I've started keeping an eye out for original brainteasers, just for that reason. GCHQ's Christmas puzzle just came out [1], and o1-pro got 6 out of 7 of them right. It took about 20 minutes in total.
I wasn't going to bother trying those because I was pretty sure it wouldn't get any of them, but decided to give it an easy one (#4) and was impressed at the CoT.
Meanwhile, Google's newest 2.0 Flash model went 0 for 7.
I didn't see a straightforward way to submit the final problem, because I used different contexts for each of the 7 subproblems.
Given the right prompt, though, I'm sure it could handle the 'find the corresponding letter from the landmarks to form an anagram' part. That's easier than most of the other problems.
You're saying the ultimate answer isn't 'PROTECTING THE UNITED KINGDOM'?