Hacker News new | ask | show | jobs
by losvedir 34 days ago
The question was whether you were giving it the rendered image and using the model's visual modal capability, or feeding back in the textual SVG.

It's hard to "imagine" what the rendered SVG looks like, for both humans and LLMs, so just iterating on text won't really be as useful of a test. But if you show it what it rendered, it might observe the bad-looking bicycle and be able to fix the text that way.

1 comments

"I've even experimented with feeding the broken pelican svgs to an image model to look for flaws, and they still fail to spot the broken elements."