Hacker News new | ask | show | jobs
by stefanbaumann 873 days ago
Thanks a lot!

Yeah, the main motivation was trying to find a way to enable transformers to do high-resolution image synthesis: transformers are known to scale well to extreme, multi-billion parameter scales and typically offer superior coherency & composition in image generation, but current architectures are too expensive to train at scale for high-resolution inputs.

By using a hierarchical architecture and local attention at high-resolution scales (but retaining global attention at low-resolution scales), it becomes viable to apply transformers at these scales. Additionally, this architecture can now directly be trained on megapixel-scale inputs and generate high-quality results without having to progressively grow the resolution over the training or applying other "tricks" typically needed to make models at these resolutions work well.