Hacker News new | ask | show | jobs
by stellaathena 1403 days ago
Stable Diffusion produces substantially higher quality images in most context, but is much more expensive to produce. The genius of VQGAN-CLIP is that it showed that you could take two pre-existing models and combine them to get text-to-image synthesis to work at all. By contrast, models like DALL-E and Stable Diffusion require extremely expensive pretraining.

There's a discussion of this in the VQGAN-CLIP paper, see in particular 6.1 "Efficiency as a Value" https://arxiv.org/abs/2204.08583

Disclaimer: I'm one of the authors of the VQGAN-CLIP paper and was tangentially involved with Stable Diffusion.