| Diffusion is more parameter-efficient and you quickly saturate the target fidelity, especially with some refiner cascade. It's a solved problem. You do not need more than maybe 4B total. Images are far more redundant than text. In fact, most interesting papers since Imagen show that you get more mileage out of scaling the text encoder part, which is, of course, a Transformer. This is what drives accuracy, text rendering, compositionality, parsing edge cases. In SD 1.5 the text encoder part (CLIP ViT-L/14) takes a measly 123M parameters.[1] In Imagen, it was T5-XXL with 4.6B [2]. I am interested in someone trying to use a really strong encoder baseline – maybe from a UL2-20B – to push this tactic further. Seeing as you can throw out diffusion altogether and synthesize images with transformers [3], there is no reason to prioritize the diffusion part as such. 1. https://forums.fast.ai/t/stable-diffusion-parameter-budget-a... 2. https://arxiv.org/abs/2205.11487 3. https://arxiv.org/abs/2301.00704 |
That’s actually how this whole party got started. DALL-E (the first one) was a transformer model trained on image tokens from an early VAE (and text tokens ofc). Researchers from CompVis developed VQGAN in response. OpenAI showed improved fidelity with guided diffusion over ImageNet (classes) and subsequently DALLE2 using pixel space diffusion and cascading up sampling. CompVis responded with Latent Diffusion which used diffusion in the latent space of some new VQGANs.
The paper you mention is interesting! They go back to the DALL-E 1 method but train two VQGAN’s for upsampling and increase the parameter count. This is faster, but only faster than originally reported benchmarks using inferior sampling methods for their diffusion. I would be curious if they can beat some of the more recent ones which require as few as 10-20 steps.
They also improve on FID/CLIP scores likely by using more parameters. This might be a memory/time trade off though. I would be curious how much more VRAM their model requires compared to SD, MJ, Kandinsky.
The same goes for using T5-XXL. You’ll win FID score contests but no one will be able to run it without an A100 or TPU pod.