I wonder what the speed of this approach vs traditional ocr techniques. Also, curious if this could be used for text detection (find a bounding box containing text within an image).
Was just coming here to say this, there does not yet exist a multimodal vision LLM approach that is capable of identifying bounding boxes of where the text occurs. I suppose you could manually cut the image up and send each part separately to the LLM but that feels like an kludge and it's still in-exact.
Wait what? That's pretty neat. I'm on my phone right now, so I can't really view the notebook very easily. How does this work? Are you using some kind of continual partitioning of the image and refeeding that back into the LLM to sort of pseudo-zoom in/out on the parts that contain non-cut off text until you can resolve that into rough coordinates?