|
|
|
|
|
by vunderba
478 days ago
|
|
Was just coming here to say this, there does not yet exist a multimodal vision LLM approach that is capable of identifying bounding boxes of where the text occurs. I suppose you could manually cut the image up and send each part separately to the LLM but that feels like an kludge and it's still in-exact. |
|