Hacker News new | ask | show | jobs
by jmchambers 618 days ago
Frantically Googles VAE...

Ah, okay, so the work is done at a different level of abstraction, didn't know that. But I guess it's still a pixel-related abstraction, and it is converted back to pixels to generate the final image?

I suppose in my proposed (and probably implausible) algorithm, that different level of abstraction might be loosely analogous to collections of related game engine assets that are often used together, so that the denoising algorithm might be effectively saying things like "we'll put some building-related assets here-ish, and some park-related flora assets over here...", and then that gets crystallised in to actual placement of individual assets in the post-processing step.

1 comments

(High level, specifics are definitely wrong here)

The VAE isn't really pixel-level, it's semantic-level. The most significant bits in the encoding are like "how light or dark is the image" and then towards the other end bits represent more niche things like "if it's an image of a person, make them wear glasses". This is way more efficient than using raw pixels because it's so heavily compressed, there's less data. This was one of the big breakthroughs of stable diffusion compared to previous efforts like disco diffusion that work on the pixel level.

The VAE encodes and decodes images automatically. It's not something that's written, it's trained to understand the semantics of the images in the same way other neural nets are.