|
|
|
|
|
by pierre
635 days ago
|
|
Parsing docs using LVM is the way forward (also see OCR2 paper released last week, people are having ablot of success parsing with fine tunned Qwen2). The hard part is to prevent the model ignoring some part of the page and halucinations (see some of the gpt4o sample here like the xanax notice:https://www.llamaindex.ai/blog/introducing-llamaparse-premiu...) However this model will get better and we may soon have a good pdf to md model. |
|
- VLMs are way better at handling layout and context where OCR systems fail miserably
- VLMs read documents like humans do, which makes dealing with special layouts like bullets, tables, charts, footnotes much more tractable with a singular approach rather than have to special case a whole bunch of OCR + post-processing
- VLMs are definitely more expensive, but can be specialized and distilled for accurate and cost effective inference
In general, I think vision + LLMs can be trained to explicitly to “extract” information and avoid reasoning/hallucinating about the text. The reasoning can be another module altogether.