| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by julius 50 days ago

Click coordinates. Agentic GUI is really annoying when the multi-modal agent cannot click on x,y coordinates.

I tested Qwen3.6, Gemma4, Nemotron3-nano-omni. They fully hallucinate x,y coords. (did not try GLM-5V yet)

GPT-5.5 can easily do it. But also Vocaela, a tiny 500M model, is quite good at it. Hope they improve the training for x,y clicking soon on the smallish multi-modals.

Recently slopped a http service together just so my local models can click, instead of relying on all the wild ways agents currently hack into the browser (browser-use, browser-harness, agent-browser, dev-browser etc) https://github.com/julius/vocaela-click-coords-http

3 comments

cyanydeez 50 days ago

This sounds a lot like another hacker news posted in the last few days. The same problem image generators have with a prompt like, produce numbers 1-50 in a spiral pattern and it can't count properly. But if you break it into a raster/vector where you have it first produce the visual content and then a SVG overlay, it's completely capable.

Have you tried doing a two step: review the image, then render a vector?

link

julius 50 days ago

Maybe there is a smart trick to get them to do the right thing, but the things I tried did not work.

At one point I had some smaller model draw bounding boxes around everything that looked interactable and labels like "e3" ... then asked the model to tell me "click on e3". Did not work in my tests was pretty much as bad as x,y.

link

cyanydeez 50 days ago

Yeah, I've held off on doing any kind of rag till there's models that properly handle layout detection and partitioning because it's so easy to generate shitty data if you're not properly attending to visual cues first before you slice up a document.

link

lopuhin 50 days ago

Qwen3.5 is able to output click coordinates and bounding boxes just fine, as values normalized to 0..1000, I’d hope Qwen3.6 didn’t loose this capability.

link

withinrafael 50 days ago

I've had lots of success with generating coordinates and answering questions using the UI-TARS model https://github.com/bytedance/UI-TARS.

link

theturtletalks 50 days ago

I’d also checkout midscene, you can set the model and UI-TARS works but you can also use qwen vision models and it works.

link