|
|
|
|
|
by spi
812 days ago
|
|
The weights are different, because the model is different. As jzbontar below mentions, the crucial point is that the random noise mask is the same. The diffusion models are trained to turn random noise to an image, and they are deterministic at that - the same noise leads to the same image. What the authors did here was to find a smart way of training a new model able to "simulate" in a single step what diffusion achieves in many; to do so, they took many triplets of (prompt, noise, image) generated starting from random noise and a (fixed) pretrained stable diffusion checkpoint. The model is trained to replicate the results. So, it is surprising that this works at all at creating meaningful images, but it would be _really_ surprising (i.e. probably impossible) if it generated meaningful images which were seriously different from the ones it was pretrained with! |
|
Pardon my ignorance ...
Does MIT model then not work as a general text-to-image model to generate novel images based on arbitrary new text prompts that it has not seen before?