| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by phire 216 days ago

Most image models are diffusion models, not LLMs, and have a bunch of other idiosyncrasies.

So I suspect it's more that lessons from diffusion image models don't carry over to text LLMs.

And the Image models which are based on multi-mode LLMs (like Nano Banana) seem to do a lot better at novel concepts.

1 comments

Gormo 213 days ago

But the clocks in this demo aren't images.

link

phire 213 days ago

Yes, but they are reasoning within their dataset, which will contain multiple example of html+css clocks.

They are just struggling to produce good results because they are language models and don’t have great spatial reasoning skills, because they are language models.

Their output normally has all the elements, just not in the right place/shape/orientation.

link