They do mention that the missing data test was done on "new" data that the models were not viewed trained on in the article so it's not just regurgitation for at least some of the results it seems.
One way to test this kind of efficacy is to compare it to a known sample with a missing piece, e.g.: create an artifact with known text, destroy it in similar fashion, compare what this model suggests as outputs with the real known text.
The "known" sample would need to be handled and controlled for by an independent trusted party, obviously, and therein lies the problem: It will be hard to properly configure an experiment and believe it if any of the parties have any kind of vested interest in the success of the project.
Someone deleted part of a known text.
This does require the AI hasn’t been trained on the test text previously..