Can you explain a bit more why the recurrent network structure becomes necessary at some point? Is that because reversing a CNN naturally means rendering by (de)convolution?
In order to approximately learn a "real" graphics engine with support for basic physics, just feed-forward computation might not be sufficient. A more natural way to learn graphics/physics might be to learn the temporal structure more explicitly. On the other hand, it might also be interesting to just add temporal convolution-deconvolution structure in the existing model. This is work in progress though.