I can't open all of the links in the article because of some Cloudflare issue (perhaps related to me being on a plane) but is the version and sub iteration of the model they actually used for this the same as the one they announced the capability on a year ago? If so, did they comment why didn't they just show this a year ago (they seem to have been publishing successively better results slowly instead).
In my memory, it was OpenAI o1. OpenAI o1 was released in September 2024. It seems the difference came from the difference between benchmark performance and propagation.