| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kleiba 616 days ago
	Why?

1 comments

WithinReason 616 days ago

Diffusion works significantly better for images than sequential pixel generation, there is a good chance it would work better for language as well.

Sequential generation used to be state of the art in 2016 and it's basically how current LLMs work:

https://arxiv.org/abs/1601.06759

link

kleiba 616 days ago

Neural LMs used to be based on recurrent architectures until the Transformer came along. That architecture is not recursive.

I am not sure that a diffusion approach is all that suitable for generating language. Word are much more discrete than pixels.

link

WithinReason 616 days ago

I meant sequential generation, I didn't mean using an RNN.

Diffusion doesn't work on pixels directly either, it works on a latent representation.

link

kleiba 616 days ago

All NNs work on latent representations.

link

barrkel 616 days ago

The contrast here is real: there are pixel space diffusion models and latent space diffusion models. Pixel space diffusion is slower because there's more redundant information.

link

famouswaffles 615 days ago

The most popular method using autoregression in image generation space is to predict image patches/tokens and not pixels, though that still scales worse than diffusion.

A fairly new but promising approach for autoregression that seems to scale as well as diffusion is predicting the next image scale/resolution rather than the next image patch.

https://arxiv.org/abs/2404.02905

link

magicalhippo 616 days ago

I had similar thoughts to you.

However diffusion models suck at details, like how many fingers on a hand, and with language words and characters matter, both which ones and where they are.

So while I'm sure diffusion could produce walls of text that look convincingly like a blog post at a glance say, I'm not sure it would hold up to anyone actually reading.

link