Hacker News new | ask | show | jobs
by zozbot234 1 day ago
Edge devices don't just have limited memory bandwidth though, they also have very limited compute. To the extent where you don't actually need all that much batching to saturate their viable compute and run into obvious thermal/power limits. (It's just not true that "requests are inherently serial" in edge inference; any time you have multiple requests (i.e. "chats") in flight, batching becomes applicable if you have enough memory capacity for the KV caches.) I'm not sure how diffusion models are supposed to help there, if they simply take more compute for lower-quality outcomes and a dubious saving in memory bandwidth.
1 comments

Forgot to mention it previously, but this might be a good model for a narrow slice of midrange systems that really are more skewed towards compute than memory bandwidth, but also don't have enough memory capacity to effectively use batching. (E.g. top-of-the-range consumer GPUs, or earlier generations of datacenter GPUs.) Although you do also compete with things like MTP there, which is targeting a similar tradeoff, or with denser models featuring a similar amount of total parameters. So I'd say that the jury is very much still out, even in that narrow space. Diffusion models are also apparently very hard to scale to a hundred-billion or trillion parameter count, since the way you train them is completely different to the usual one-token-at-a-time models.