|
|
|
|
|
by HanClinto
510 days ago
|
|
Maybe still worth it to separate the tasks, and use a traditional text detection model to find bounding boxes, then crop the images. In a second stage, send those cropped samples to the higher-power LLMs to do the actual text extraction, and don't worry about them for bounding boxes at all. There are some VLLMs that seem to be specifically trained to do bounding box detection (Moondream comes to mind as one that advertises this?), but in general I wouldn't be surprised if none of them work as well as traditional methods. |
|