If you look at the "workflow" section of that page, they had to add a bunch of scaffolding around telling the model what moves are legal -- an llm can't keep enough context to know how to play chess; only to choose an advantageous move from a given list. But feel free to "cherry pick".
i ran the benchmark without the valid moves tool as well as the three mistakes grace and gpt-5.4 holds well. it can achieve 1000 ELO which is much higher than my own.
this clearly tells me that GPT is good at chess, at least better than a normal person who has played ~30-40 games in their life.