| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mturnshek 501 days ago

You may already know this, but image generators like Stable Diffusion and Flux already do this in the form of “latent diffusion”.

Rather than operate on pixel space directly, they learn to operate on images that have been encoded by a VAE (latents). To generate an image with them, you run the reverse diffusion (actually flow in the case of flux) process they’ve learned and then decode the result using the VAE.

These VAE encoded latent images are 8x smaller in width/height and have 4 channels in the case of Stable Diffusion and 16 in the case of Flux.

I do think it would be more useful if it worked more like you said, though - if the channels weren’t encoded arbitrarily but some of them had pretty clear, useful human meaning like lightness, it would be another hook to control image generation.

To some extent, you can control the existing VAE channels, but it is pretty finicky.

2 comments

sigmoid10 501 days ago

If there's one thing that neural networks have shown, it's that they are much better at picking up encoding patterns for realistic tasks than humans. There are so many aspects that could be used in dimensional reduction tasks that it seems pretty wild that we've come this far with human-designed patterns. From a top down engineering perspective, it might seem like a disadvantage to have algorithms that are not tailored to particular cases. But when you want things like general purpose image generation, it's simply much more economical to let ML figure out which dimensions to focus on. Because humans would spend years coming up with the details of certain formats and still not cover half the cases.

link

orbital-decay 500 days ago

>You may already know this, but image generators like Stable Diffusion and Flux already do this in the form of “latent diffusion”.

They... don't. Latents don't meaningfully represent human perception, they represent correlations in the dataset. Parent is talking about the function aligned with actual measured human perception (UCS is an example of that). Whether it's a good idea, and how trivial it is for the model to fit this function automatically, is another question.

link