Hacker News new | ask | show | jobs
by emregucerr 495 days ago
i wonder how good is R1 at counting pixels from a screenshot. what enabled claude and OAI's CUA to develop computer use was being able to precisely give x-y coordinates of a click location.

also, how big of a gain to have reasoning for computer use? i feel like reasoning unlocks a lot when there is a single complex question but not so much better at taking actions in a long term plan.

1 comments

Yep, coordinate grounding is key, we use Ai2's pixmo for a lot of that https://huggingface.co/datasets/allenai/pixmo-points

We had previously created https://huggingface.co/datasets/agentsea/wave-ui but that was superseded by pixmo as it contains over a million data points.