This isn't recommended, the decoding takes as much time as processing next step. I learned it the hard way when I tried displaying the intermediate steps for debugging.
yeah, running the full decoder takes a while. though, since the "latent" is just 4 channels and pretty close to representing RGB, you can use a linear combination of latent channels and get a basic (grainy, low-res) preview image like this [0] without much trouble. I expect you could go further, and train a shallow conv-only decoder to get nicer preview results, but I'm not sure if anyone's bothered yet.
[0] https://github.com/madebyollin/maple-diffusion