| I had a somewhat similar experience trying to use LLMs to do OCR. All the models I've tried (Sonnet 3.5, GPT 4o, Llama 3.2, Qwen2 VL) have been pretty good at extracting text, but they failed miserably at finding bounding boxes, usually just making up random coordinates. I thought this might have been due to internal resizing of images so tried to get them to use relative % based coordinates, but no luck there either. Eventually gave up and went back to good old PP-OCR models (are these still state of the art? would love to try out some better ones). The actual extraction feels a bit less accurate than the best LLMs, but bounding box detection is pretty much spot on all the time, and it's literally several orders of magnitude more efficient in terms of memory and overall energy use. My conclusion was that current gen models still just aren't capable enough yet, but I can't help but feel like I might be missing something. How the heck did Anthropic and OpenAI manage to build computer use if their models can't give them accurate coordinates of objects in screenshots? |
You may also be able to get the computer use API to draw bounding boxes if the costs make sense.
That said, I think the correct solution is likely to use a non-VLM to draw bounding boxes. Depends on the dataset and problem.
1. https://www.anthropic.com/news/developing-computer-use 2. https://huggingface.co/blog/paligemma