| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pierre 635 days ago

Parsing docs using LVM is the way forward (also see OCR2 paper released last week, people are having ablot of success parsing with fine tunned Qwen2).

The hard part is to prevent the model ignoring some part of the page and halucinations (see some of the gpt4o sample here like the xanax notice:https://www.llamaindex.ai/blog/introducing-llamaparse-premiu...)

However this model will get better and we may soon have a good pdf to md model.

3 comments

fzysingularity 635 days ago

We’ve been doing exactly this by doubling-down on VLMs (https://vlm.run)

- VLMs are way better at handling layout and context where OCR systems fail miserably

- VLMs read documents like humans do, which makes dealing with special layouts like bullets, tables, charts, footnotes much more tractable with a singular approach rather than have to special case a whole bunch of OCR + post-processing

- VLMs are definitely more expensive, but can be specialized and distilled for accurate and cost effective inference

In general, I think vision + LLMs can be trained to explicitly to “extract” information and avoid reasoning/hallucinating about the text. The reasoning can be another module altogether.

link

yigitkonur35 635 days ago

I did a ton of Googling before writing this code, but I couldn't find you guys anywhere. If I had, I'd have definitely used your stuff. You might want to think about running some small-scale Google Ads campaigns. They could be especially effective if you target people searching for both LLM and OCR together. Great product, congratz!

link

fzysingularity 635 days ago

Hey, thanks! DM me if you want to test it out (sudeep@vlm.run).

Agreed on SEO - we’re redoing our landing page and searchability. We recently rebranded, hence the lack of direct search hits for LLM / OCR.

link

authorfly 635 days ago

What about combining old school OCR with GPT visual OCR?

If your old school OCR output has output that is not present in the visual one, but is coherent (e.g. english sentences), you could get it back and slot it into the missing place from the visual output.

link

yigitkonur35 635 days ago

You're absolutely right. I use PDFTron (through CloudCovert) for full document OCR, but for pages with fewer than 100 characters, I switch to this API. It's a great combo – I get the solid OCR performance of SolidDocument for most content, but I can also handle tricky stuff like stats, old-fashioned text, or handwriting that regular OCR struggles with. That's why I added page numbers upfront.

link

fkilaiwi 635 days ago

what paper are you referring to?

link

perrywky 634 days ago

I guess this: https://arxiv.org/html/2409.01704v1

link