Hacker News new | ask | show | jobs
by GaggiX 1253 days ago
Dalle 2 does not use any adversarial loss (so no GAN), it uses a text2image diffusion based model and two diffusion based upscaler, VQGAN is an autoencoder, alone it can't do much, Dalle 1 works thx to the autoregressive model (also no GAN), Stable Diffusion uses an autoencoder because running a diffusion model on a 1024/768/512 image is really inefficient as the model has no bottleneck, the autoencoder has an adversarial loss but upscaling a 64x64x4 latent up to a 512x512x3 image is a much simpler job than generating the 64x64x4 from scratch, that's why you need a diffusion or an autoregressive model as a base.
1 comments

Thanks for the corrections, I was including autoencoders that use an additional adversarial loss (such as VQGAN) when I said GAN.

> Dalle 1 works thx to the autoregressive model (also no GAN)

It uses an autoregressive model to predict codes for a pretrained VQGAN, doesn't it?

Doesn't Stable Diffusion's autoencoder also use an adversarial loss? Otherwise wouldn't it suffer the typical blurring problems well known to MSE?

Yes, all the autoencoders you see used in practice have adversarial loss + MSE + perceptual loss, the VAE used with Stable Diffusion also uses KL regularization, while VQGAN uses all other losses to make use of the codebook.