It's basically using a pretrained VGG network as a latent space, which lets them synthesize images.
https://www.biorxiv.org/content/10.1101/787101v3