|
|
|
|
|
by ahzhou
504 days ago
|
|
LLMs are inherently bad at this due to tokenization, scaling, and lack of training on the task. Anthropic’s computer use feature has a specialized model for pixel-counting:
> Training Claude to count pixels accurately was critical. Without this skill, the model finds it difficult to give mouse commands. [1]
For a VLM trained on identifying bounding boxes, check out PaliGemma [2] You may also be able to get the computer use API to draw bounding boxes if the costs make sense. That said, I think the correct solution is likely to use a non-VLM to draw bounding boxes. Depends on the dataset and problem. 1. https://www.anthropic.com/news/developing-computer-use
2. https://huggingface.co/blog/paligemma |
|
PaliGemma seems to fit into a completely different niche right now (VQA and Segmentation) that I don't really see having practical applications for computer use.
[1] https://huggingface.co/microsoft/OmniParser?language=python [2] https://github.com/browser-use/browser-use