| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by EarlyOom 483 days ago
	We can do bounding boxes too :) we just call it visual grounding https://github.com/vlm-run/vlmrun-cookbook/blob/main/noteboo...

2 comments

what 482 days ago

Kind of skeptical since you also provide a “confidence” value, which has to be entirely made up.

Do you have an example that isn’t a sample drivers license? Something that is unlikely to have appeared in an LLM’s training data?

link

vunderba 483 days ago

Wait what? That's pretty neat. I'm on my phone right now, so I can't really view the notebook very easily. How does this work? Are you using some kind of continual partitioning of the image and refeeding that back into the LLM to sort of pseudo-zoom in/out on the parts that contain non-cut off text until you can resolve that into rough coordinates?

link