| Like anyone deeply in a field I know maybe several thousand people who could probably give a better answer, but I figure I'll give an effort to provide one since I don't see any good ones posted yet. The moment everyone knew this was going to be big was in 2019 when StyleGAN came out. They used a lot of tricks like aligning face features (like eyes) and had all their pictures of a single domain (the most famous being faces) but none the less, that was the moment everyone in the AI field knew this was going to be big, and so three years ago a lot of big people shifted to this line of research. The four main innovations since then have been: 1. Transformers Generalized computation kernels which allow for images to consider non-localised relationships between pixels of an image. Released in 2017, and originally used for language. 2. Pixel Patch Encodings Different resolution semantic and geometric image information encodings which allow for better representations of relationships between image areas than pixels are able to achieve given the same compute. Allows using Transformers on high resolution images. 3. CLIP Contrastive Language and Image Pairing. Before, the only way we knew to classify an image was as a "face" or "cat" or "ramen". When the genius idea of labeling images as semantically meaningful vectors rather than one hot encoded classes was revealed, it changed everything in computer vision very quickly, and problems that used to be hard became trivial. Released in 2021 4. Diffusion Models GANs penalise you for making an image which does not seem to be part of an existing dataset. This encourages one to make the worst quality image that looks like a member of that dataset. Diffusion learns to denoise an image, removing noise is perceptually similar to increasing resolution, people like images that look that way. There may be more people with better intuition about diffusion models may be able to add on why they're superior. I've read all the papers leading up to the latest unCLIP (Dalle2) but it's complicated. Released in 2020, with major improvements to the training process continuously being made since then. Hope this was helpful. All of the above were only implemented for images in any real way in the last three years. Putting them all together is something many people only just this year did, resulting in DallE, Stable Diffusion, and Imagen. I'm working on doing this for 3D and later for use cases in AR. 3D generation still hasn't been cracked the same way image has but the above will likely contribute to the solution to that as well. Anyone who's intersted in working on that feel free to message me. |
The models behind Imagen and StableDiffusion are actually simpler than DALLE2, and both are higher quality (SD of course isn’t always since it’s much smaller). That suggests DALLE3 will also be simpler again.
There’s also been very recent work with generalized diffusion models (that use problems other than noise removal and still work) and Google researchers have been tweeting results from a merged Imagen/Parti in the last few days.