Hacker News new | ask | show | jobs
by jjcm 34 days ago
A lot of these things are made fast and loose, and unfortunately this is the reality of using the bleeding edge. Even Figma went through this kind of thing very early on.

To add something else to the discussion however, I'd encourage people to skip out on Claude Design for other reasons, and that is the inherent restrictions of LLMs for visual design. LLMs are blind, and spatial relativity is tremendously hard across layers of nested html / css.

If you're early on, I'd recommend starting with diffusion first. GPT-Image-2 is phenominal at UI design, and especially if you're just starting out will let you align on a direction more rapidly than an LLM can. The difficulty will be converting from image->html, but you'll be able to explore different directions more cheaply/faster than you could with Claude Design.

I will note a bias disclaimer here - I quit Figma to work on my own diffusion-based UI design tool. Not promoting that here, but wanted to at least share my findings in this space.

10 comments

What do you mean LLMs are blind? All frontier models are multimodal, which means they literally consume images as tokens. They can “see” exactly as well as they can “read”.

Also, GPT-Image-2 is not a diffusion model, it is based on Transformers, like other LLMs are.

I guess they do "see" but more like "see an explanation of the image", not "see" as in experience visually. They're really bad at details and perfection when it comes to images, and doesn't understand things like visual hierarchy, affordances and other fundamental design concepts. Most of them are able to describe those things with letters, but doesn't seem to actually fundamentally grasp it when asking it to do UIs even when mentioning these things.

Try doing 100% vibe-coding with an agent and loosely specify what kind of application you want, and observe how the resulting UI and UX is a complete mess, unless you specify exactly how the UI and UX should work in practice.

If they actually had spatial understanding, together with being able to visually experience images, then they'd probably be able to build proper UI/UX from the get go, but since they only could describe what those things are, you end up with the messes even the current SOTAs produce.

the models can accept images directly as tokens. not a description of an image, the actual image itself.

yes, the visual intelligence is limited, but they do actually have vision capabilities.

Yes, I agree, we're saying the same thing, I'm just trying to highlight that the "visual intelligence" really isn't up to par for anything stringent when it comes to UI and UX. Explained further here: https://news.ycombinator.com/item?id=48133641
> I guess they do "see" but more like "see an explanation of the image", not "see" as in experience visually.

Images are tokenized and fed to the exact same model, they can “visually inspect” images, eg “find the 2 differences between two images” and “where’s Waldo”-style things.

So your mental model that they see descriptions is inaccurate.

> Images are tokenized

Exactly, here is where the fidelity of an image is being lost, they don't "see" visually, they get a representation of the image via tokens, that's why I said they don't see but basically "see an explanation of the image". I don't mean like a caption, but in the end, they act and work with tokens, not pixels or actual images, internally.

Example from Grok and Claude, with a very simple test case. I made a white image with 7 dots, ask Claude and Grok to count the red dots. The filename is "8-red-dots.png" but actually only has 7 dots.

Because they don't actually receive the image itself, they receive "tokenized images" as you say, they don't seem to actually be able to see the number of red dots. ChatGPT correctly identified that there are only 7 dots, but only because it ended up using Python to actually count the pixels it seems.

Original image + what the various LLMs responded: https://imgur.com/a/vh1tU6Y

Again, very simple (and dumb test), I won't claim this is science, but once you start trying to use these vision models for precise and exact UI and UX work, you'll notice over and over how bad fidelity and spatial awareness they actually have when it comes to images.

You don’t “see” visually either! It’s just that when photons hit your rods and cones some electrical impulses go down your optic nerves and hit your visual cortex, and some math happens that your sensory systems interpret as vision. But it’s nothing of the sort, just a low-fidelity trick.
This is my experience too, but with all other aspects of the application. If you only loosely describe it, it comes out as a mess. You have to know what you're building to get the LLM to actually build something decent. I don't think this is purely a visual or design constraint.
When I'm using agents for programming, I can have a AGENTS.md outlining exactly what requirements, guidelines and constraints all the code need to follow, and the agent (codex in my case) will pretty much nail that.

I've tried doing the same for design work, just really outlining exactly how the UI and UX needs to look and work, but for some reason it struggles a whole bunch with it, regardless of how clear I am. Maybe it's I'm just worse at explaining and describing what UI and UX I'm actually after though, I suppose.

I once worked at a startup where the CEO was originally a designer. He once spent two days huddled with the main designer for the product, trying to pick exactly the right font for the product. I have no idea how you'd have that kind of discussion with an LLM.

But then, I would not spend more than five minutes on this decision, so I'm probably the wrong audience for this ;)

Used to work in a designer-heavy company doing frontend work, one of the founders could spot by naked eye if you got the alignment of something wrong by 2-3 pixels during the reviews.

The UI and UX of the product was amazing, and took some time to get used to actually delivering pixel-perfect designs across three different browsers, but fun times regardless :) Probably takes a certain individual to enjoy that sort of experience though.

Tokens are not a substitute for a numerical measurement.

Ask a LLM how much time has passed. Watch it hallucinate wildly.

Has anyone noticed that Opus has trouble building ascii diagrams (often leaves out spaces so lines are misaligned)?

LLMs are just one mechanical component. One might as well say "Ask your println how much time has passed". That is not a question that makes sense. As an example, I did not construct my agent specifically to answer your question and when I saw your question I queried the agent. And it is correct. https://imgur.com/a/j8j7hL9

As semiquaver said, modern LLMs are multi-modal, they can reason in image-space and audio-space as well as in text-space. It is not a translate then operate kind of situation. Claude Design is not a raw LLM, nor an instruction-tuned LLM. It is an agent harness around an LLM that allows it to do certain things.

Ok? Your comment is in no way responsive to anything I said.
> Also, GPT-Image-2 is not a diffusion model, it is based on Transformers, like other LLMs are.

Where are you getting this from btw? AFAIK, OpenAI hasn't openly talked about what exactly is powering the Images 2.0 stuff, unless I missed something? I think they've said it's not a diffusion model, but I'm not sure they've said what they're doing instead, have they?

I believe it's an evolution of the technique used in GPT-Image-1 (or whatever they called that), which was derived from their work on making GPT-4o an "omni" model that can directly output images and audio in addition to text.

The 2024 GPT-4o launch post https://openai.com/index/hello-gpt-4o/ hints about how that works:

"With GPT‑4o, we trained a single new model end-to-end across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network."

Yeah, that's my belief as well, but haven't seen any concrete explanations about how it works, just the marketing/press releases sadly.
Claude has been kicking ass at code, but I asked it to “sketch” a second floor with a stairway and bedrooms with large closets and it made … something that resembles something akin to not at all what I asked.
This has not been my experience. Claude artifacts at first, then Claude Design after it was released, are excellent at design! The way I can steer the model updating the design with different ideas and visions, even adopting different design systems like Material 3 or Apple’s HIG it has been phenomenal.
It's also by far the best in my experience at a request like "it's 3:55 and I need a few slides on the topic of the Gettysburg Address for a 4PM meeting."

I wish it was more integrated into PowerPoint but it's still the best slide generator I've used.

I found gpt5.5 great at that too
Thank you so much for your suggestion regarding UI design. As my main expertise is not this, I need some tool to depend on to ground my projects somehow. Even though stitch by google and claude design are not perfect, they give me some starting point. And then, after building the actual working project, will iterate until I like the look of it. This is how I'm using these right now. I can't even itearte on these design LLM's now, their own UX is very clunky and not very friendly, or its made more for the design folks.

But I will give GPT-Image-2 a try. Actually few months back I remember doing this UX/UI research on the chat gpt app itself, just asking it to generate what a certain app might look like and etc.

Please let me know your UI design tool. I'm want to try it out.

> A lot of these things are made fast and loose

No kidding - you can't even delete a design system, draft or otherwise. Research Preview is accurate, it can do some things (but every system I've tried building it has resorted to the "hero text with key word in a different color" trope, however I try different prompts), but there's a lot missing (and when you ask Claude Design how to delete a design system it gives you an absolutely inaccurate and hallucinated answer and you say fine, here's the project ID, do it for me, "Sorry, can't, only you can").

Or just use Google's Stitch, it integrates both code via Gemini and image UI generation via Nano Banana which I'd argue is even better than OpenAI's image models.
It's really not, gpt-image-2 is #1 by over 100 ELO.
What's the source of that, are there image benchmarks?
> A lot of these things are made fast and loose

Yeah, I'm starting to be worried about Anthropic's security controls for customer information.

To say they'd have a firehose of sensitive info from customers would be a massive understatement. Hackers gaining access to that, especially for a non-trivial duration, would be a disaster.

If you say the image models don't "see" you also have to say the text models don't "read": there's a meaningful case to be made for either claim but then you're left saying "they behave as if they see" or "they behave as if they read".
Multimodal LLMs are not blind.

Claude design in my experience is very, very solid.

I’ve only used it for fairly basic stuff, things that are very well represented in the training data. But for that it has made me happy.
Huh, I never thought of asking an image model to prototype a UI. It's a good idea though, I will try it next time.
> A lot of these things are made fast and loose, and unfortunately this is the reality of using the bleeding edge.

Anthropic lazily calls everything a preview and then pushes it hard on everyone. That feels dishonest