|
|
|
|
|
by julius
50 days ago
|
|
Click coordinates. Agentic GUI is really annoying when the multi-modal agent cannot click on x,y coordinates. I tested Qwen3.6, Gemma4, Nemotron3-nano-omni. They fully hallucinate x,y coords.
(did not try GLM-5V yet) GPT-5.5 can easily do it. But also Vocaela, a tiny 500M model, is quite good at it. Hope they improve the training for x,y clicking soon on the smallish multi-modals. Recently slopped a http service together just so my local models can click, instead of relying on all the wild ways agents currently hack into the browser (browser-use, browser-harness, agent-browser, dev-browser etc) https://github.com/julius/vocaela-click-coords-http |
|
Have you tried doing a two step: review the image, then render a vector?