What if you trained it on footage from the real world? I didn't see anything in the article that stated the AI needed any sort of feedback from interactive controls...
It doesn't need feedback from controls, but they have to be present as inputs during the learning process, otherwise you can't hook them up correctly. Learning real-world physics from video would be impressive in its own right, but alone it's not enough to create a game. It's also not necessary, since we can already simulate most physical phenomena; and much more efficiently than what a learning process is likely to produce at first.
In addition to the limitations stated by sibling response, you would also need to solve the problem of image classification before you could start on real world footage; the ai in the paper was given the videogame sprites in addition to the raw footage.