|
|
|
|
|
by stefanbaumann
873 days ago
|
|
Thanks a lot! Yeah, the main motivation was trying to find a way to enable transformers to do high-resolution image synthesis: transformers are known to scale well to extreme, multi-billion parameter scales and typically offer superior coherency & composition in image generation, but current architectures are too expensive to train at scale for high-resolution inputs. By using a hierarchical architecture and local attention at high-resolution scales (but retaining global attention at low-resolution scales), it becomes viable to apply transformers at these scales.
Additionally, this architecture can now directly be trained on megapixel-scale inputs and generate high-quality results without having to progressively grow the resolution over the training or applying other "tricks" typically needed to make models at these resolutions work well. |
|