Hacker News new | ask | show | jobs
by samuelknight 13 days ago
Some of these comments miss the advantage of diffusion. This is will have a big impact on edge devices, such as your phone or the GPU in your computer.

An LLM's decoder computes tokens one-at-a-time because attention has to account for each previous token. The existing LLM decoders scale well when you have enough load to batch many inferences together. Diffusion of limited benefit there. On edge you have a different problem: your inference accelerator is starved while sloshing GB of weights back and forth from RAM. That's because the consumer RAM like LPDDRx/GDDRx is lower bandwidth than HBM, and the requests are serial so you can't batch compute common weights. Diffusion can compute tokens in parallel which relieves the memory bandwidth bottle neck.

2 comments

Edge devices don't just have limited memory bandwidth though, they also have very limited compute. To the extent where you don't actually need all that much batching to saturate their viable compute and run into obvious thermal/power limits. (It's just not true that "requests are inherently serial" in edge inference; any time you have multiple requests (i.e. "chats") in flight, batching becomes applicable if you have enough memory capacity for the KV caches.) I'm not sure how diffusion models are supposed to help there, if they simply take more compute for lower-quality outcomes and a dubious saving in memory bandwidth.
Forgot to mention it previously, but this might be a good model for a narrow slice of midrange systems that really are more skewed towards compute than memory bandwidth, but also don't have enough memory capacity to effectively use batching. (E.g. top-of-the-range consumer GPUs, or earlier generations of datacenter GPUs.) Although you do also compete with things like MTP there, which is targeting a similar tradeoff, or with denser models featuring a similar amount of total parameters. So I'd say that the jury is very much still out, even in that narrow space. Diffusion models are also apparently very hard to scale to a hundred-billion or trillion parameter count, since the way you train them is completely different to the usual one-token-at-a-time models.
You’re mostly right but conflating attention with autoregressive/causal which is the real issue that prevents you from using more compute

You can use diffusion with attention, and this model does in fact use attention

Yes, I should have said autoregressive.