| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by whilefalse 110 days ago

Hey, I made this, thanks for posting!

It’s purposefully high level and non-technical for a general audience - my theory was that most people who aren’t into tech/AI don’t care too much about training, or how the system got to be the way that it is.

But they do have some interest in how it actually operates once you’ve typed in a prompt.

Happy to answer any questions or take on board feedback

5 comments

in-silico 107 days ago

I think some of the visualizations would be much better if you used a pixel-space model instead of a latent diffusion model.

Right now we are only seeing the denoising process after it's been morphed by the latent decoder, which looks a lot less intuitive than actual pixel diffusion.

If you can't find a suitable pixel-space model, then you can just trivially generate a forward process and play it backwards.

link

whilefalse 107 days ago

Thanks that’s a great suggestion.

link

socalgal2 107 days ago

Thanks for this!

Has there been any study of grammar and other word order effects in the result? Is "Dog fetches ball with tail" more likely to produce an image of dog with a ball grabbed with its tail than "tail ball dog fetch with"?

Like search engines, an issue is user searched for "best price on windows". Do they mean windows the OS or glass windows.

My impression, at least with image generation I've used, it's while there is some mapping of words and maybe phrases through the latent space to an image it's very weak. If you put "red ball" in a long prompt, it's nearly as likely "red" will get applied to some other part of the description than the ball.

link

whilefalse 107 days ago

Honestly I don’t know the answer to that but it’s a good question and something interesting to look into. The PRX model I used ran pretty well on my MacBook M4 so you could play around, although I guess it will depend on the specifics of the model.

When I was building this I did have to rework the prompts quite a bit so they worked nicely with the word-by-word reveal visualisation, i.e. they mention the subject early, then add adjectives about setting and light etc.

link

BobbyTables2 108 days ago

Loved the writeup!

Found the manual latent space exploration part really interesting.

Too many LLM/diffusion explanations fall in the proverbial “how to draw an owl” meme without giving a taste as to what’s going on.

link

plagiarist 108 days ago

I enjoyed this a lot.

The interpolations between butterfly and snail were pretty horrifying. But something like Z-Image you could basically concatenate the text and end up with a normal image of both. Is the latent space for "butterfly and snail" just well off the path between the two individually?

It's hard to imagine what is nearby in latent space and how text contributes, so I did really like the section adding words to the prompt 1-by-1.

link

adampunk 107 days ago

It's quite clever and thoughtful. thanks for making it!

link