Hacker News new | ask | show | jobs
by gliptic 617 days ago
> I _think_ I understand the basic premise behind stable diffusion, i.e., reverse the denoising process to generate realistic images but, as far as I know, this is always done at the pixel level.

It's typically not done at the pixel level, but at the "latent space" level of e.g. a VAE. The image generation is done in this space, which has fewer outputs than the pixels of the final image, and then converted to the pixels using the VAE.

1 comments

Frantically Googles VAE...

Ah, okay, so the work is done at a different level of abstraction, didn't know that. But I guess it's still a pixel-related abstraction, and it is converted back to pixels to generate the final image?

I suppose in my proposed (and probably implausible) algorithm, that different level of abstraction might be loosely analogous to collections of related game engine assets that are often used together, so that the denoising algorithm might be effectively saying things like "we'll put some building-related assets here-ish, and some park-related flora assets over here...", and then that gets crystallised in to actual placement of individual assets in the post-processing step.

(High level, specifics are definitely wrong here)

The VAE isn't really pixel-level, it's semantic-level. The most significant bits in the encoding are like "how light or dark is the image" and then towards the other end bits represent more niche things like "if it's an image of a person, make them wear glasses". This is way more efficient than using raw pixels because it's so heavily compressed, there's less data. This was one of the big breakthroughs of stable diffusion compared to previous efforts like disco diffusion that work on the pixel level.

The VAE encodes and decodes images automatically. It's not something that's written, it's trained to understand the semantics of the images in the same way other neural nets are.