Hacker News new | ask | show | jobs
by programjames 462 days ago
Anyone willing to give an intuitive summary of what they did mathwise? The math in the paper is super ugly to churn through.
4 comments

Last author here (I also did the DDIM paper, https://arxiv.org/abs/2010.02502). I know this is going to be very tricky math-wise (and in the paper we just wrote the most general thing to make reviewers happy), so I tried to explain the idea more easily under the blog post (https://lumalabs.ai/news/inductive-moment-matching).

If you look at how a single step of the DDIM sampler interacts with the target timestep, it is actually just a linear function. This is obviously quite inflexible if we want to use it to represent a flexible function where we can choose any target timestep. So just add this as an argument to the neural network and then train it with a moment matching objective.

In general, I feel that analyzing a method's inference-time properties before training it can be helpful to not only diffusion models, but also LLMs including various recent diffusion LLMs, which prompted me to write a position paper in the hopes that others develop cool new ideas (https://arxiv.org/abs/2503.07154).

Just as a counter perspective: I think your paper is great!

Please don’t let people ever discourage you from writing proper papers. Ever since meta etc. started asking for „2 papers in relevant fields“ we see a flood of papers that should be tweets.

What happens if we don't add any moments matching objective? e.g. at train time just fit a diffusion model that predicts the target given any pair of timesteps (t, t')? Why is moment matching critical here?

Also regarding linearity, why is it inflexible? It seems quite convenient that a simple linear interpolation is used for reconstruction, besides, even in DDIM, the directions towards the final target changes at each step as the images become less noisy. In standard diffusion models or even flow matching, denoising is always equal to the prediction of the original data + direction from current timestep to the timestep t'. Just to be clear, it is intuitive that such models are inferior in few-step generations since they don't optimise for test time efficiency (in terms of the tradeoff of quality vs compute), but it's unclear what inflexibility exists there beyond this limitation.

Clearly there's no expected benefit in quality if all timesteps are used in denoising?

Stupid question, what's a “timestep” in that context?
The authors own summary from the position paper is:

In particular, we examine the one-step iterative process of DDIM [39, 19, 21] and show that it has limited capacity with respect to the target timestep under the current denoising network design. This can be addressed by adding the target timestep to the inputs of the denoising network [15].

Interestingly, this one fix, plus a proper moment matching objective [5] leads to a stable, single-stage algorithm that surpasses diffusion models in sample quality while being over an order of magnitude more efficient at inference [50]. Notably, these ideas do not rely on denoising score matching [46] or the score-based stochastic differential equations [41] on which the foundations of diffusion models are built.

In normal diffusion you train a model to take lots of tiny steps, all the same small size. e.g. "You're gonna take 20 steps, at times [1.0, 0.95, 0.90, 0.85...]" and each time the model takes that small fixed-size step to make the image look better.

Here they train a model to say "I'm gonna ask you to take a step from time B to A - might be a small step, might be a big step - but whatever size it is, make the image that much better." You you might ask the model to improve the image from t=1.0 to t=0.25 and be almost done. It gets a side variable telling it how much improvement to make in each step.

I'm not sure this right, but that's what I got out of it by skimming the blog & paper.

No, we typically train any diffusion model on a single step (randomly chosen).
The math is totally standard if you've read recent important papers on score matching and flow matching. If you haven't, well, I can't see how you could possibly hope to understand this work at a technical level anyways.