Hacker News new | ask | show | jobs
by SekstiNi 1148 days ago
They took down the blogpost, but from what I remember the model is composite and consists of a text encoder as well as 3 "stages":

1. (11B) T5-XXL text encoder [1]

2. (4.3B) Stage 1 UNet

3. (1.3B) Stage 2 upscaler (64x64 -> 256x256)

4. (?B) Stage 3 upscaler (256x256 -> 1024x1024)

Resolution numbers could be off though. Also the third stage can apparently use the existing stable diffusion x4, or a new upscaler that they aren't releasing yet (ever?).

> Once these are quantized (I assume they can be)

Based on the success of LLaMA 4bit quantization, I believe the text encoder could be. As for the other modules, I'm not sure.

edit: the text encoder is 11B, not 4.5B as I initially wrote.

[1]: https://huggingface.co/google/t5-v1_1-xxl

2 comments

You'll be able to optimize it a lot to make it fit on small systems if you are willing to modify your workflow a bit: instead of 1 prompt -> 1 image _n_ times, do 1 prompt -> _n_ images 1 time -> _m_ times... For a given prompt, run it through the T5 model and store; you can do that in CPU RAM if you have to because you only need the embedding once so you don't need a GPU which can run T5-XXL naively. Then you can get a large batch of samples from #2; 64px is enough to preview; only once you pick some do you run through #3, and then from those through #4. Your peak VRAM should be 1 image in #2 or #4 and that can be quantized or pruned down to something that will fit on many GPUs.
The entire T5-XXL model is 11B but you don't need the decoder.