Hacker News new | ask | show | jobs
by gfiorav 477 days ago
I wonder what the speed of this approach vs traditional ocr techniques. Also, curious if this could be used for text detection (find a bounding box containing text within an image).
1 comments

Was just coming here to say this, there does not yet exist a multimodal vision LLM approach that is capable of identifying bounding boxes of where the text occurs. I suppose you could manually cut the image up and send each part separately to the LLM but that feels like an kludge and it's still in-exact.
We can do bounding boxes too :) we just call it visual grounding https://github.com/vlm-run/vlmrun-cookbook/blob/main/noteboo...
Kind of skeptical since you also provide a “confidence” value, which has to be entirely made up.

Do you have an example that isn’t a sample drivers license? Something that is unlikely to have appeared in an LLM’s training data?

Wait what? That's pretty neat. I'm on my phone right now, so I can't really view the notebook very easily. How does this work? Are you using some kind of continual partitioning of the image and refeeding that back into the LLM to sort of pseudo-zoom in/out on the parts that contain non-cut off text until you can resolve that into rough coordinates?
qwen 2.5 vl was specifically trained to produce bounding boxes I believe.