Hacker News new | ask | show | jobs
by ShamelessC 1060 days ago
> Seeing as you can throw out diffusion altogether and synthesize images with transformers [3]

That’s actually how this whole party got started. DALL-E (the first one) was a transformer model trained on image tokens from an early VAE (and text tokens ofc). Researchers from CompVis developed VQGAN in response. OpenAI showed improved fidelity with guided diffusion over ImageNet (classes) and subsequently DALLE2 using pixel space diffusion and cascading up sampling. CompVis responded with Latent Diffusion which used diffusion in the latent space of some new VQGANs.

The paper you mention is interesting! They go back to the DALL-E 1 method but train two VQGAN’s for upsampling and increase the parameter count. This is faster, but only faster than originally reported benchmarks using inferior sampling methods for their diffusion. I would be curious if they can beat some of the more recent ones which require as few as 10-20 steps.

They also improve on FID/CLIP scores likely by using more parameters. This might be a memory/time trade off though. I would be curious how much more VRAM their model requires compared to SD, MJ, Kandinsky.

The same goes for using T5-XXL. You’ll win FID score contests but no one will be able to run it without an A100 or TPU pod.

1 comments

> The same goes for using T5-XXL

Is this still true in 2023? Sure, back in the dark ages it seemed like a 860M model is just about the limit for a regular consumer, but I don't see why we wouldn't be able to use quantized encoders; and even 30B LLMs run okay on Macbooks now.

That’s a fair point and I’m not sure actually. I bet you’re right though.