|
|
|
|
|
by stellaathena
1403 days ago
|
|
Stable Diffusion produces substantially higher quality images in most context, but is much more expensive to produce. The genius of VQGAN-CLIP is that it showed that you could take two pre-existing models and combine them to get text-to-image synthesis to work at all. By contrast, models like DALL-E and Stable Diffusion require extremely expensive pretraining. There's a discussion of this in the VQGAN-CLIP paper, see in particular 6.1 "Efficiency as a Value" https://arxiv.org/abs/2204.08583 Disclaimer: I'm one of the authors of the VQGAN-CLIP paper and was tangentially involved with Stable Diffusion. |
|