Hacker News new | ask | show | jobs
by EarlyOom 483 days ago
We can do bounding boxes too :) we just call it visual grounding https://github.com/vlm-run/vlmrun-cookbook/blob/main/noteboo...
2 comments

Kind of skeptical since you also provide a “confidence” value, which has to be entirely made up.

Do you have an example that isn’t a sample drivers license? Something that is unlikely to have appeared in an LLM’s training data?

Wait what? That's pretty neat. I'm on my phone right now, so I can't really view the notebook very easily. How does this work? Are you using some kind of continual partitioning of the image and refeeding that back into the LLM to sort of pseudo-zoom in/out on the parts that contain non-cut off text until you can resolve that into rough coordinates?