Their methodology shows they can create an infinite variety of problems.
This is the same thing as synthetic training data.
It doesn't matter if models are trained on the output of the generated data or not. If the model ends up being able to solve newly generated variations, you'd have to admit that it understands the underlying problems.
I think what it shows that it has minimal "understanding" of the problem - otherwise such small variations wouldn't pose a challenge. Training it to handle these specific small variations doesn't change that.
If it were a complete failure on variations I would be inclined to agree. Instead it was a 30% drop in performance. I would characterise that as limited understanding.
This is the same thing as synthetic training data.
It doesn't matter if models are trained on the output of the generated data or not. If the model ends up being able to solve newly generated variations, you'd have to admit that it understands the underlying problems.