Hacker News new | ask | show | jobs
by carbocation 1048 days ago
> I did this in pixel space for simplicity, cool animations, and compute costs

A slight nitpick, wouldn't doing diffusion in the latent space be cheaper?

2 comments

Depends, given the low res, the 3x64x64 pixel space image is smaller than the latents you would get from encoding a higher-res image with models like VQGAN or the stablediff VAE at their native resolutions.

It's easier to get a sense of what's going wrong with a pixel space model though. With latent space, there's always the question of how color is represented in latent space / how entangled it is with other structure / semantics.

Starting in pixel space removed a lot of variables from the equation, but latent diffusion is the obvious next step

Not necessarily if you don’t already have a pretrained autoencoder.