| They took down the blogpost, but from what I remember the model is composite and consists of a text encoder as well as 3 "stages": 1. (11B) T5-XXL text encoder [1] 2. (4.3B) Stage 1 UNet 3. (1.3B) Stage 2 upscaler (64x64 -> 256x256) 4. (?B) Stage 3 upscaler (256x256 -> 1024x1024) Resolution numbers could be off though. Also the third stage can apparently use the existing stable diffusion x4, or a new upscaler that they aren't releasing yet (ever?). > Once these are quantized (I assume they can be) Based on the success of LLaMA 4bit quantization, I believe the text encoder could be. As for the other modules, I'm not sure. edit: the text encoder is 11B, not 4.5B as I initially wrote. [1]: https://huggingface.co/google/t5-v1_1-xxl |