| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by embedding-shape 34 days ago

I guess they do "see" but more like "see an explanation of the image", not "see" as in experience visually. They're really bad at details and perfection when it comes to images, and doesn't understand things like visual hierarchy, affordances and other fundamental design concepts. Most of them are able to describe those things with letters, but doesn't seem to actually fundamentally grasp it when asking it to do UIs even when mentioning these things.

Try doing 100% vibe-coding with an agent and loosely specify what kind of application you want, and observe how the resulting UI and UX is a complete mess, unless you specify exactly how the UI and UX should work in practice.

If they actually had spatial understanding, together with being able to visually experience images, then they'd probably be able to build proper UI/UX from the get go, but since they only could describe what those things are, you end up with the messes even the current SOTAs produce.

3 comments

spongebobstoes 34 days ago

the models can accept images directly as tokens. not a description of an image, the actual image itself.

yes, the visual intelligence is limited, but they do actually have vision capabilities.

link

embedding-shape 33 days ago

Yes, I agree, we're saying the same thing, I'm just trying to highlight that the "visual intelligence" really isn't up to par for anything stringent when it comes to UI and UX. Explained further here: https://news.ycombinator.com/item?id=48133641

link

stingraycharles 34 days ago

> I guess they do "see" but more like "see an explanation of the image", not "see" as in experience visually.

Images are tokenized and fed to the exact same model, they can “visually inspect” images, eg “find the 2 differences between two images” and “where’s Waldo”-style things.

So your mental model that they see descriptions is inaccurate.

link

embedding-shape 33 days ago

> Images are tokenized

Exactly, here is where the fidelity of an image is being lost, they don't "see" visually, they get a representation of the image via tokens, that's why I said they don't see but basically "see an explanation of the image". I don't mean like a caption, but in the end, they act and work with tokens, not pixels or actual images, internally.

Example from Grok and Claude, with a very simple test case. I made a white image with 7 dots, ask Claude and Grok to count the red dots. The filename is "8-red-dots.png" but actually only has 7 dots.

Because they don't actually receive the image itself, they receive "tokenized images" as you say, they don't seem to actually be able to see the number of red dots. ChatGPT correctly identified that there are only 7 dots, but only because it ended up using Python to actually count the pixels it seems.

Original image + what the various LLMs responded: https://imgur.com/a/vh1tU6Y

Again, very simple (and dumb test), I won't claim this is science, but once you start trying to use these vision models for precise and exact UI and UX work, you'll notice over and over how bad fidelity and spatial awareness they actually have when it comes to images.

link

semiquaver 31 days ago

You don’t “see” visually either! It’s just that when photons hit your rods and cones some electrical impulses go down your optic nerves and hit your visual cortex, and some math happens that your sensory systems interpret as vision. But it’s nothing of the sort, just a low-fidelity trick.

link

marcus_holmes 34 days ago

This is my experience too, but with all other aspects of the application. If you only loosely describe it, it comes out as a mess. You have to know what you're building to get the LLM to actually build something decent. I don't think this is purely a visual or design constraint.

link

embedding-shape 34 days ago

When I'm using agents for programming, I can have a AGENTS.md outlining exactly what requirements, guidelines and constraints all the code need to follow, and the agent (codex in my case) will pretty much nail that.

I've tried doing the same for design work, just really outlining exactly how the UI and UX needs to look and work, but for some reason it struggles a whole bunch with it, regardless of how clear I am. Maybe it's I'm just worse at explaining and describing what UI and UX I'm actually after though, I suppose.

link

marcus_holmes 34 days ago

I once worked at a startup where the CEO was originally a designer. He once spent two days huddled with the main designer for the product, trying to pick exactly the right font for the product. I have no idea how you'd have that kind of discussion with an LLM.

But then, I would not spend more than five minutes on this decision, so I'm probably the wrong audience for this ;)

link

embedding-shape 33 days ago

Used to work in a designer-heavy company doing frontend work, one of the founders could spot by naked eye if you got the alignment of something wrong by 2-3 pixels during the reviews.

The UI and UX of the product was amazing, and took some time to get used to actually delivering pixel-perfect designs across three different browsers, but fun times regardless :) Probably takes a certain individual to enjoy that sort of experience though.

link