Hacker News new | ask | show | jobs
by astrange 1253 days ago
This claims to explain diffusion models from first principles, but the issue with explaining how they work is we don't know how they work.

The explanation in the original paper turns out not to be true; you can get rid of most of their assumptions and it still works: https://arxiv.org/abs/2208.09392

3 comments

> The explanation in the original paper turns out not to be true; you can get rid of most of their assumptions and it still works

I’ll admit it is amusing that some assumptions on why it works were incorrect. The core idea of a Markov chain[0] where each state change leads to higher likelihood, is bound to work, even if the rest doesn’t.

In my mind, the Muse paper[1] gets closer to why it works: ultimately, the denoiser tries to match the latent space for an implicit encoder. The Muse system does this more directly and more effectively, by using cross-entropy loss on latent tokens instead.

In a way, the whole problem is no different from a language translation task. The only difference is that the output needs to be decoded into pixels instead of BPE tokens.

[0]: https://arxiv.org/abs/1503.03585

[1]: https://arxiv.org/abs/2301.00704

Huh, that is quite fascinating paper – we can learn to invert any image degradation and use it as generative model? Hm. Is there any research of using some U-net as the degradation function?
Thanks for the link, cold diffusion is a great idea. Only 2 comments on HN about it four months ago.