|
|
|
|
|
by erwannmillon
1049 days ago
|
|
Depends, given the low res, the 3x64x64 pixel space image is smaller than the latents you would get from encoding a higher-res image with models like VQGAN or the stablediff VAE at their native resolutions. It's easier to get a sense of what's going wrong with a pixel space model though. With latent space, there's always the question of how color is represented in latent space / how entangled it is with other structure / semantics. Starting in pixel space removed a lot of variables from the equation, but latent diffusion is the obvious next step |
|