| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by airgapstopgap 1060 days ago

Diffusion is more parameter-efficient and you quickly saturate the target fidelity, especially with some refiner cascade. It's a solved problem. You do not need more than maybe 4B total. Images are far more redundant than text.

In fact, most interesting papers since Imagen show that you get more mileage out of scaling the text encoder part, which is, of course, a Transformer. This is what drives accuracy, text rendering, compositionality, parsing edge cases. In SD 1.5 the text encoder part (CLIP ViT-L/14) takes a measly 123M parameters.[1] In Imagen, it was T5-XXL with 4.6B [2]. I am interested in someone trying to use a really strong encoder baseline – maybe from a UL2-20B – to push this tactic further.

Seeing as you can throw out diffusion altogether and synthesize images with transformers [3], there is no reason to prioritize the diffusion part as such.

1. https://forums.fast.ai/t/stable-diffusion-parameter-budget-a...

2. https://arxiv.org/abs/2205.11487

3. https://arxiv.org/abs/2301.00704

2 comments

ShamelessC 1060 days ago

> Seeing as you can throw out diffusion altogether and synthesize images with transformers [3]

That’s actually how this whole party got started. DALL-E (the first one) was a transformer model trained on image tokens from an early VAE (and text tokens ofc). Researchers from CompVis developed VQGAN in response. OpenAI showed improved fidelity with guided diffusion over ImageNet (classes) and subsequently DALLE2 using pixel space diffusion and cascading up sampling. CompVis responded with Latent Diffusion which used diffusion in the latent space of some new VQGANs.

The paper you mention is interesting! They go back to the DALL-E 1 method but train two VQGAN’s for upsampling and increase the parameter count. This is faster, but only faster than originally reported benchmarks using inferior sampling methods for their diffusion. I would be curious if they can beat some of the more recent ones which require as few as 10-20 steps.

They also improve on FID/CLIP scores likely by using more parameters. This might be a memory/time trade off though. I would be curious how much more VRAM their model requires compared to SD, MJ, Kandinsky.

The same goes for using T5-XXL. You’ll win FID score contests but no one will be able to run it without an A100 or TPU pod.

link

airgapstopgap 1060 days ago

> The same goes for using T5-XXL

Is this still true in 2023? Sure, back in the dark ages it seemed like a 860M model is just about the limit for a regular consumer, but I don't see why we wouldn't be able to use quantized encoders; and even 30B LLMs run okay on Macbooks now.

link

ShamelessC 1060 days ago

That’s a fair point and I’m not sure actually. I bet you’re right though.

link

Etherlord87 1060 days ago

> Images are far more redundant than text.

"A picture is worth a thousand words" - I wonder how (in)accurate this popular saying turned out to be? :D

link

elpocko 1060 days ago

I'm gonna go ahead and say in 2023, one detailed picture (512x512) is worth about 30 words.

link

SketchySeaBeast 1060 days ago

I guess that depends on the prompt.

link

k12sosse 1060 days ago

Do negative prompt tokens count as words?

link