Now we need some kind of object recognition-enhanced version of style transfer that learns constraints on what "makes sense" given "sensible" labeled/captioned training examples!
It has been done with manual segmentation [1]. And the results are mind blowing. There also a lot of work done on segmentation with neural nets [2, 3], so I wouldn't be surprised to see someone implementing this idea in the near future.
1: https://github.com/alexjc/neural-doodle 2: https://arxiv.org/pdf/1605.06211.pdf 3: http://mi.eng.cam.ac.uk/projects/segnet/