Hacker News new | ask | show | jobs
by airgapstopgap 1060 days ago
Diffusion is more parameter-efficient and you quickly saturate the target fidelity, especially with some refiner cascade. It's a solved problem. You do not need more than maybe 4B total. Images are far more redundant than text.

In fact, most interesting papers since Imagen show that you get more mileage out of scaling the text encoder part, which is, of course, a Transformer. This is what drives accuracy, text rendering, compositionality, parsing edge cases. In SD 1.5 the text encoder part (CLIP ViT-L/14) takes a measly 123M parameters.[1] In Imagen, it was T5-XXL with 4.6B [2]. I am interested in someone trying to use a really strong encoder baseline – maybe from a UL2-20B – to push this tactic further.

Seeing as you can throw out diffusion altogether and synthesize images with transformers [3], there is no reason to prioritize the diffusion part as such.

1. https://forums.fast.ai/t/stable-diffusion-parameter-budget-a...

2. https://arxiv.org/abs/2205.11487

3. https://arxiv.org/abs/2301.00704

2 comments

> Seeing as you can throw out diffusion altogether and synthesize images with transformers [3]

That’s actually how this whole party got started. DALL-E (the first one) was a transformer model trained on image tokens from an early VAE (and text tokens ofc). Researchers from CompVis developed VQGAN in response. OpenAI showed improved fidelity with guided diffusion over ImageNet (classes) and subsequently DALLE2 using pixel space diffusion and cascading up sampling. CompVis responded with Latent Diffusion which used diffusion in the latent space of some new VQGANs.

The paper you mention is interesting! They go back to the DALL-E 1 method but train two VQGAN’s for upsampling and increase the parameter count. This is faster, but only faster than originally reported benchmarks using inferior sampling methods for their diffusion. I would be curious if they can beat some of the more recent ones which require as few as 10-20 steps.

They also improve on FID/CLIP scores likely by using more parameters. This might be a memory/time trade off though. I would be curious how much more VRAM their model requires compared to SD, MJ, Kandinsky.

The same goes for using T5-XXL. You’ll win FID score contests but no one will be able to run it without an A100 or TPU pod.

> The same goes for using T5-XXL

Is this still true in 2023? Sure, back in the dark ages it seemed like a 860M model is just about the limit for a regular consumer, but I don't see why we wouldn't be able to use quantized encoders; and even 30B LLMs run okay on Macbooks now.

That’s a fair point and I’m not sure actually. I bet you’re right though.
> Images are far more redundant than text.

"A picture is worth a thousand words" - I wonder how (in)accurate this popular saying turned out to be? :D

I'm gonna go ahead and say in 2023, one detailed picture (512x512) is worth about 30 words.
I guess that depends on the prompt.
Do negative prompt tokens count as words?