|
|
|
|
|
by fzysingularity
639 days ago
|
|
We’ve been doing exactly this by doubling-down on VLMs (https://vlm.run) - VLMs are way better at handling layout and context where OCR systems fail miserably - VLMs read documents like humans do, which makes dealing with special layouts like bullets, tables, charts, footnotes much more tractable with a singular approach rather than have to special case a whole bunch of OCR + post-processing - VLMs are definitely more expensive, but can be specialized and distilled for accurate and cost effective inference In general, I think vision + LLMs can be trained to explicitly to “extract” information and avoid reasoning/hallucinating about the text. The reasoning can be another module altogether. |
|