The contrast here is real: there are pixel space diffusion models and latent space diffusion models. Pixel space diffusion is slower because there's more redundant information.
The most popular method using autoregression in image generation space is to predict image patches/tokens and not pixels, though that still scales worse than diffusion.
A fairly new but promising approach for autoregression that seems to scale as well as diffusion is predicting the next image scale/resolution rather than the next image patch.
However diffusion models suck at details, like how many fingers on a hand, and with language words and characters matter, both which ones and where they are.
So while I'm sure diffusion could produce walls of text that look convincingly like a blog post at a glance say, I'm not sure it would hold up to anyone actually reading.
Sequential generation used to be state of the art in 2016 and it's basically how current LLMs work:
https://arxiv.org/abs/1601.06759