Hacker News new | ask | show | jobs
by minimaxir 670 days ago
The diffusion process is conditioned on CLIP text, which works better (in theory) since the encoded text is aligned with images.