Hacker News new | ask | show | jobs
by cubefox 401 days ago
It's nice that this contains a comparison between diffusion models that are used for image models, and the autoregressive models that are used for LLMs.

But recently (2024 NeuIPS paper of the year) there was a new paper on autoregressive image modelling that apparently outperforms diffusion models: https://arxiv.org/abs/2404.02905

The innovation is that it doesn't predict image patches (like older autoregressive image models) but somehow does some sort of "next scale" or "next resolution" prediction.

In the past, autoregressive image models did not perform as well as diffusion models, which meant that most image models used diffusion. Now it seems autoregressive techniques have a strict advantage over diffusion models. Another advantage is that they can be integrated with autoregressive LLMs (multimodality), which is not possible with diffusion image models. In fact, the recent GPT-4o image generation is autoregressive according to OpenAI. I wonder whether diffusion models still have a future now.

2 comments

From what I can tell, it doesn't look like the recent GPT-4o image generation includes the research of the NeurIPS paper you cited. If it did, we wouldn't see a line-by-line generation of the image, which we do currently in GPT-4o, but rather a decoding similar to progressive JPEG.

I'm not 100% convinced that diffusion models are dead. That paper fixes autoregression for 2D spaces by basically turning the generation problem from pixel-by-pixel to iterative upsampling, but if 2D was the problem (and 1D was not), why don't we have more autoregressive models in 1D spaces like audio?

> From what I can tell, it doesn't look like the recent GPT-4o image generation includes the research of the NeurIPS paper you cited. If it did, we wouldn't see a line-by-line generation of the image, which we do currently in GPT-4o, but rather a decoding similar to progressive JPEG.

Going off my bad memory, but I think I remember a comment saying the line-by-line generation was just a visual effect.

>From what I can tell, it doesn't look like the recent GPT-4o image generation includes the research of the NeurIPS paper you cited. If it did, we wouldn't see a line-by-line generation of the image, which we do currently in GPT-4o, but rather a decoding similar to progressive JPEG.

You could, because it's still autoregressive. It still generates patches left to right, top to bottom. It's just that we're not starting with patches at the target resolution.

> From what I can tell, it doesn't look like the recent GPT-4o image generation includes the research of the NeurIPS paper you cited.

Which means autoregressive image models are even ahead of diffusion on multiple fronts, i.e. both in whatever GPT-4o is doing and in the method described in the VAR paper.

>The innovation is that it doesn't predict image patches (like older autoregressive image models) but somehow does some sort of "next scale" or "next resolution" prediction.

It still predicts image patches, left to right and top to bottom. The main difference is that you start with patches at a low resolution.