unfortunately in my experience that's the case for most of opensource AI projects out there, while the showcase results are hand-picked or the algorithms was trained and tuned to solve that specific image.
I doubt this AI gets perfect performance even in the training set. Deep generative models are known to underfit more rather than overfit, i.e. they can't even do a good job of the full training set let alone the test set. The cherry-picked examples you see are just statistical outliers corresponding to VERY easy examples.
Statistically-random excellent performance in complex tasks is very unlikely. More likely is that examples are in-sample from a small training set or very similar. Big NNs can memorize anything.
They can memorize any supervised learning task, but so far, we haven't been able to see any deep generative model successfully memorize something more complex than MNIST.