|
|
|
|
|
by jiggawatts
456 days ago
|
|
The learned frequency banks reminded me of a notion I had: Instead of learning upscaling or image generation in pixel space, why not reuse the decades of effort that has gone into lossy image compression by generating output in a psychovisually optimal space? Perhaps frequency space (discrete cosine transform) with a perceptually uniform color space like UCS. This would allow models to be optimised so that they spend more of their compute budget outputting detail that's relevant to human vision. Color spaces that split brightness from chroma would allow increased contrast detail and lower color detail. This is basically what JPG does. |
|
Rather than operate on pixel space directly, they learn to operate on images that have been encoded by a VAE (latents). To generate an image with them, you run the reverse diffusion (actually flow in the case of flux) process they’ve learned and then decode the result using the VAE.
These VAE encoded latent images are 8x smaller in width/height and have 4 channels in the case of Stable Diffusion and 16 in the case of Flux.
I do think it would be more useful if it worked more like you said, though - if the channels weren’t encoded arbitrarily but some of them had pretty clear, useful human meaning like lightness, it would be another hook to control image generation.
To some extent, you can control the existing VAE channels, but it is pretty finicky.