Hacker News new | ask | show | jobs
by hnlmorg 348 days ago
LLMs can be used as an agent to do all sorts of clever things, but it doesn’t mean the LLM is actually handling the original data format.

I’ve created MCP servers that can scrape websites but that doesn’t mean the LLM itself can make HTTP calls.

The reason I make this distinction is because someone claimed that LLMs can read images. But they don’t. They act as an agent for another model that reads images and creates metadata from it. LLMs then turn that meta data into natural language.

The LLM itself doesn’t see any pixels. It sees textual information that another model has provided.

Edit: reading more about this online, it seems LLMs can work with pixel level data. I had no idea that was possible.

My apologies.

1 comments

No problem. Again, if it happened the way you described (which it did, until GPT-4o recently), the LLM wouldn't have been able to edit images. You can't get a textual description of an image and reconstruct it perfectly just from that, with one part edited.
We have been able to edit images since Stable Diffusion