| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by vunderba 478 days ago
	Was just coming here to say this, there does not yet exist a multimodal vision LLM approach that is capable of identifying bounding boxes of where the text occurs. I suppose you could manually cut the image up and send each part separately to the LLM but that feels like an kludge and it's still in-exact.

2 comments

EarlyOom 478 days ago

We can do bounding boxes too :) we just call it visual grounding https://github.com/vlm-run/vlmrun-cookbook/blob/main/noteboo...

link

what 478 days ago

Kind of skeptical since you also provide a “confidence” value, which has to be entirely made up.

Do you have an example that isn’t a sample drivers license? Something that is unlikely to have appeared in an LLM’s training data?

link

vunderba 478 days ago

Wait what? That's pretty neat. I'm on my phone right now, so I can't really view the notebook very easily. How does this work? Are you using some kind of continual partitioning of the image and refeeding that back into the LLM to sort of pseudo-zoom in/out on the parts that contain non-cut off text until you can resolve that into rough coordinates?

link

chpatrick 478 days ago

qwen 2.5 vl was specifically trained to produce bounding boxes I believe.

link