Hacker News new | ask | show | jobs
by wokwokwok 1349 days ago
> But, as we take a bunch of small steps and gradually move back through the diffusion process...

...but, the question is, why can't we take a big step and be at the end in one step.

Obviously a series of small steps gets you there, but the question was why you need to take small steps.

I feel like this is just a 'intuitive explanation' that doesn't actually do anything other than rephrase the question; "You take a series of small steps to reduce the noise in each step and end up with a picture with no noise".

The real reason is that big steps result in worse results (1); the model was specifically designed to be a series of small steps because when you take big steps, you end up with over fitting, where the model just generates a few outputs from any input.

(1) - https://arxiv.org/pdf/1503.03585.pdf

2 comments

The reason why big steps produce worse results, when using current architectures and loss functions, is precisely because the least squares prediction error and simple "predict the mean" approach used to train the inverse model does not permit sufficient representational capacity to capture the almost always multimodal conditional distribution p(clean image | noisy image at step t) that the inverse model attempts to approximate.

Essentially, current approaches rely strongly on an assumption that the conditional we want to estimate in each step of the reverse diffusion process is approximately an isotropic Gaussian distribution. This assumption breaks down as you increase the size of the steps, and models which rely on the assumption also break down.

This is not directly related to overfitting. It is a fundamental aspect of how these models are designed and trained. If the architecture and loss function for training the inverse model were changed it would be possible to make an inverse model that inverts more steps of the forward diffusion process in a single go, but then the inverse model would need to become a full generative model on its own.

> This assumption breaks down as you increase the size of the steps, and models which rely on the assumption also break down.

Hm. Why's that?

The only reason I mentioned over fitting is because that's literally what they say in the paper I linked, that the diffusion factor was selected to prevent over fitting.

...

I guess I don't really have a deep understanding of this stuff, but your explanation seems to be missing, specifically that noise is added to the latent each round, on a schedule (1), less noise each round.

that's what causes it to converge on a 'final' value; you're explicitly modifying the amount of additional noise you feed in. If you don't add any noise, you get nothing more from doing 1 step than you do from 10 or 50.

Right?

"as we take a bunch of small steps and gradually move back through the diffusion process, the effective distribution of real images over which this inverse diffusion prediction averages has lower and lower entropy"

I'm really not sure about that... :/

(1) - "For binomial diffusion, the discrete state space makes gradient ascent with frozen noise impossible. We instead choose the forward diffusion schedule β1···T to erase a constant fraction 1 T of the original signal per diffusion step, yielding a diffusion rate of βt = (T − t + 1)−1."

To train the inverse diffusion model, we take a clean image x0 and generate a noisy sample xt which is from the distribution over points that x0 would visit following t steps of forward diffusion. For any value of t, any xt which is visited by x0 is also visited by some other clean images x0' when we run t steps of diffusion starting from those x0'. In general, there will be many such x0' for any xt which our initial x0 might visit after t steps of forward diffusion.

If t is small and the noise schedule for diffusion adds small noise at each step, then the inverse conditional p(x0 | xt) which we want to learn will be approximately a unimodal Gaussian. This is an intrinsic property of the forward diffusion process. When t is large, or the diffusion schedule adds a lot of noise at each step, the conditional p(x0 | xt) will be more complex and include a larger fraction of the images in the training set.

"If you don't add any noise, you get nothing more from doing 1 step than you do from 10 or 50." -- there are actually models which deterministically (approximately) integrate the reverse diffusion process SDE and don't involve any random sampling aside from the initial xT during generation.

For example, if t=T, where T is the total length of the diffusion process, then xt=xT is effectively an independent Gaussian sample and the inverse conditional p(x0 | xT) is simply p(x0) which is the distribution of the training data. In general, p(x0) is not a unimodal isotropic Gaussian. If it was, we could just model our training set by fitting the mean and (diagonal) covariance matrix of a Gaussian distribution.

"I'm really not sure about that... :/" -- the forward diffusion process initiated from x0 iteratively removes information about the starting point x0. Depending on the noise schedule, the rate at which information about x0 is removed by the addition of noise can vary. Whether we're in the continuous or discrete setting, this means the inverse conditional p(x0 | xt) will increase in entropy as t goes from 1 to T, where T is the max number of diffusion steps. So, when we generate an image by running the inverse diffusion process the conditional p(x0 | xt) will have shrinking entropy as t is now decreasing.

The quote (1) you reference is about how trying to directly optimize the noise schedule is more challenging when working with discrete inputs/latents. Whether the noise schedule is trained, as in their continuous case, or defined a priori, as in their discrete case, each step of forward diffusion removes information about the input and what I said about shrinking entropy of the conditional p(x0 | xt) as we run reverse diffusion holds. In the case of current SOTA diffusion models, I believe the noise schedules are set via hyperopt rather than optimized by SGD/ADAM/etc.

> why can't we take a big step and be at the end in one step.

Because we're doing gradient descent. (No, seriously, it's turtles all the way down (or all the way up, considering we're at a higher level of abstraction here).)

We're trying to (quickly, in less than 100 steps) descend a gradient through a complex, irregular and heavily foggy 16384-dimensional landscape of smeared, distorted, and white-noise-covered images that kinda sorta look vaguely like what we want if you squint (well, if the neural network squints, anyway). If we try to take a big step, we don't descend the gradient faster; we fly off in a mostly random direction, clip through various proverbial cliffs, and probably end up somewhere higher up the gradient than we started.