Hacker News new | ask | show | jobs
by stavros 348 days ago
Yes, that's why ChatGPT can look at an image and change the style, or edit things in the image. The image itself is converted to tokens and passed to the LLM.
1 comments

LLMs can be used as an agent to do all sorts of clever things, but it doesn’t mean the LLM is actually handling the original data format.

I’ve created MCP servers that can scrape websites but that doesn’t mean the LLM itself can make HTTP calls.

The reason I make this distinction is because someone claimed that LLMs can read images. But they don’t. They act as an agent for another model that reads images and creates metadata from it. LLMs then turn that meta data into natural language.

The LLM itself doesn’t see any pixels. It sees textual information that another model has provided.

Edit: reading more about this online, it seems LLMs can work with pixel level data. I had no idea that was possible.

My apologies.

No problem. Again, if it happened the way you described (which it did, until GPT-4o recently), the LLM wouldn't have been able to edit images. You can't get a textual description of an image and reconstruct it perfectly just from that, with one part edited.
We have been able to edit images since Stable Diffusion