|
|
|
|
|
by semiquaver
37 days ago
|
|
What do you mean LLMs are blind? All frontier models are multimodal, which means they literally consume images as tokens. They can “see” exactly as well as they can “read”. Also, GPT-Image-2 is not a diffusion model, it is based on Transformers, like other LLMs are. |
|
Try doing 100% vibe-coding with an agent and loosely specify what kind of application you want, and observe how the resulting UI and UX is a complete mess, unless you specify exactly how the UI and UX should work in practice.
If they actually had spatial understanding, together with being able to visually experience images, then they'd probably be able to build proper UI/UX from the get go, but since they only could describe what those things are, you end up with the messes even the current SOTAs produce.