|
|
|
|
|
by mturnshek
454 days ago
|
|
You may already know this, but image generators like Stable Diffusion and Flux already do this in the form of “latent diffusion”. Rather than operate on pixel space directly, they learn to operate on images that have been encoded by a VAE (latents). To generate an image with them, you run the reverse diffusion (actually flow in the case of flux) process they’ve learned and then decode the result using the VAE. These VAE encoded latent images are 8x smaller in width/height and have 4 channels in the case of Stable Diffusion and 16 in the case of Flux. I do think it would be more useful if it worked more like you said, though - if the channels weren’t encoded arbitrarily but some of them had pretty clear, useful human meaning like lightness, it would be another hook to control image generation. To some extent, you can control the existing VAE channels, but it is pretty finicky. |
|