| HN Mirror

Nothing to pardon, asking questions is always the right thing to do :-) I also didn't look into the paper in great details, although I'm quite sure I am not fooling myself, but still take this with a grain of salt.

My understanding is that this paper by MIT doesn't train any new model from scratch. I takes a pretrained model (e.g. StableDiffusion), which however is trained to do "a small step" only: you fix a number of steps (e.g. 1000 in the MIT paper), and ask the model to predict how to "enhance" an image by a certain step (e.g. of size 1/1000); the constants are adjusted so that, if the model is "perfect", you get from pure white noise to an image in the exact number of steps you set. If I remember correctly how diffusion works, in theory you could set this number to any value, including 1, but in practice you need several hundreds to get a good result, i.e. the original StableDiffusion model is only able to fit a small adjustment.

This new paper shows how to "distil" the original model (in this case, StableDiffusion) into another model. However, unlike typical distillation, which is used to compress a big model into a smaller one, in this case the distilled model is basically the same as the one you start with; but it has been trained with a different objective, namely to transform random noise to the prediction that the original model (StableDiffusion) would make in 1000 steps. To do so, it is trained on a very large amount of triples (text, noise, image). But I don't think you can incorporate into this training procedure other "real" images that are not generated by the model you start with, because you don't have a corresponding noise (abstractly, there is no such concept as "corresponding noise" to a given image, because the relation noise -> image depends on the specific model you start with, and this map is not anywhere near invertible, since not all images can be generated by StableDiffusion, or any other model).

Once the model is trained, you can of course give it a new prompt and, in theory, it should generate something rather similar to what StableDiffusion would generate with the same prompt (hopefully, the example displayed on their web page are not from the training set! Otherwise it would be totally useless). But you should never obtain something "totally different" from what StableDiffusion would give you, so in that sense it's not "general", it is "just" a model that imitates StableDiffusion very well while being much faster. Which is already great of course :-)