Hacker News new | ask | show | jobs
by bearobear 462 days ago
Last author here (I also did the DDIM paper, https://arxiv.org/abs/2010.02502). I know this is going to be very tricky math-wise (and in the paper we just wrote the most general thing to make reviewers happy), so I tried to explain the idea more easily under the blog post (https://lumalabs.ai/news/inductive-moment-matching).

If you look at how a single step of the DDIM sampler interacts with the target timestep, it is actually just a linear function. This is obviously quite inflexible if we want to use it to represent a flexible function where we can choose any target timestep. So just add this as an argument to the neural network and then train it with a moment matching objective.

In general, I feel that analyzing a method's inference-time properties before training it can be helpful to not only diffusion models, but also LLMs including various recent diffusion LLMs, which prompted me to write a position paper in the hopes that others develop cool new ideas (https://arxiv.org/abs/2503.07154).

3 comments

Just as a counter perspective: I think your paper is great!

Please don’t let people ever discourage you from writing proper papers. Ever since meta etc. started asking for „2 papers in relevant fields“ we see a flood of papers that should be tweets.

What happens if we don't add any moments matching objective? e.g. at train time just fit a diffusion model that predicts the target given any pair of timesteps (t, t')? Why is moment matching critical here?

Also regarding linearity, why is it inflexible? It seems quite convenient that a simple linear interpolation is used for reconstruction, besides, even in DDIM, the directions towards the final target changes at each step as the images become less noisy. In standard diffusion models or even flow matching, denoising is always equal to the prediction of the original data + direction from current timestep to the timestep t'. Just to be clear, it is intuitive that such models are inferior in few-step generations since they don't optimise for test time efficiency (in terms of the tradeoff of quality vs compute), but it's unclear what inflexibility exists there beyond this limitation.

Clearly there's no expected benefit in quality if all timesteps are used in denoising?

Stupid question, what's a “timestep” in that context?