Hacker News new | ask | show | jobs
by stavros 348 days ago
If this were a limitation in the architecture, they wouldn't be able to work with images, no?
1 comments

LLMs don’t work with images.
They do, though.
Do they? I thought it was completely different models that did image generation.

LLMs might be used to translate requests into keywords, but I didn’t think LLMs themselves did any of the image generation.

Am I wrong here?

Yes, that's why ChatGPT can look at an image and change the style, or edit things in the image. The image itself is converted to tokens and passed to the LLM.
LLMs can be used as an agent to do all sorts of clever things, but it doesn’t mean the LLM is actually handling the original data format.

I’ve created MCP servers that can scrape websites but that doesn’t mean the LLM itself can make HTTP calls.

The reason I make this distinction is because someone claimed that LLMs can read images. But they don’t. They act as an agent for another model that reads images and creates metadata from it. LLMs then turn that meta data into natural language.

The LLM itself doesn’t see any pixels. It sees textual information that another model has provided.

Edit: reading more about this online, it seems LLMs can work with pixel level data. I had no idea that was possible.

My apologies.

No problem. Again, if it happened the way you described (which it did, until GPT-4o recently), the LLM wouldn't have been able to edit images. You can't get a textual description of an image and reconstruct it perfectly just from that, with one part edited.