| HN Mirror

Forgot to mention it previously, but this might be a good model for a narrow slice of midrange systems that really are more skewed towards compute than memory bandwidth, but also don't have enough memory capacity to effectively use batching. (E.g. top-of-the-range consumer GPUs, or earlier generations of datacenter GPUs.) Although you do also compete with things like MTP there, which is targeting a similar tradeoff, or with denser models featuring a similar amount of total parameters. So I'd say that the jury is very much still out, even in that narrow space. Diffusion models are also apparently very hard to scale to a hundred-billion or trillion parameter count, since the way you train them is completely different to the usual one-token-at-a-time models.