|
|
|
|
|
by psb217
1356 days ago
|
|
The reason why big steps produce worse results, when using current architectures and loss functions, is precisely because the least squares prediction error and simple "predict the mean" approach used to train the inverse model does not permit sufficient representational capacity to capture the almost always multimodal conditional distribution p(clean image | noisy image at step t) that the inverse model attempts to approximate. Essentially, current approaches rely strongly on an assumption that the conditional we want to estimate in each step of the reverse diffusion process is approximately an isotropic Gaussian distribution. This assumption breaks down as you increase the size of the steps, and models which rely on the assumption also break down. This is not directly related to overfitting. It is a fundamental aspect of how these models are designed and trained. If the architecture and loss function for training the inverse model were changed it would be possible to make an inverse model that inverts more steps of the forward diffusion process in a single go, but then the inverse model would need to become a full generative model on its own. |
|
Hm. Why's that?
The only reason I mentioned over fitting is because that's literally what they say in the paper I linked, that the diffusion factor was selected to prevent over fitting.
...
I guess I don't really have a deep understanding of this stuff, but your explanation seems to be missing, specifically that noise is added to the latent each round, on a schedule (1), less noise each round.
that's what causes it to converge on a 'final' value; you're explicitly modifying the amount of additional noise you feed in. If you don't add any noise, you get nothing more from doing 1 step than you do from 10 or 50.
Right?
"as we take a bunch of small steps and gradually move back through the diffusion process, the effective distribution of real images over which this inverse diffusion prediction averages has lower and lower entropy"
I'm really not sure about that... :/
(1) - "For binomial diffusion, the discrete state space makes gradient ascent with frozen noise impossible. We instead choose the forward diffusion schedule β1···T to erase a constant fraction 1 T of the original signal per diffusion step, yielding a diffusion rate of βt = (T − t + 1)−1."