| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by troysk 719 days ago
	In my experience, this works well but doesn't scale to all kinds of documents. For scientific papers; it can't render formulas. meta's nougat is the best model to do that. For invoices and records; donut works better. Both these models will fail in some cases so you end up running LLM to fix the issues. Even with that LLM won't be able to do tables and charts justice, as the details were lost during OCR process (bold/italic/other nuances). I feel these might also be "classical" methods. I have found vision models to be much better as they have the original document/image. Having prompts which are clear helps but still you won't get 100% results as they tend to venture off on their paths. I believe that can be fixed using fine tuning but no good vision model provides fine tuning for images. Google Gemini seems to have the feature but I haven't tried it. Few shots prompting helps keep the LLM from hallucinating, prompt injection and helps adhering to the format requested.

5 comments

jszymborski 719 days ago

Maybe a pipeline like:

1. Segment document: Identify which part of the document is text, what is an image, what is a formula, what is a table, etc...

2. For text, do OCR + LLM. You can use LLMs to calculate the expectation of the predicted text, and if it is super off, try using ViT or something to OCR.

3. For tables, you can get a ViT/CNN to identify the cells to recover positional information, and then OCR + LLM for recovering the contents of cells

4. For formulas (and formulas in tables), just use a ViT/CNN.

5. For images, you can get a captioning ViT/CNN to caption the photo, if that's desired.

link

ozim 719 days ago

I don't see how you make LLM improve tables where most of the time table is single word or single value that doesn't have continuous context like a sentence.

link

jszymborski 719 days ago

IMHO, the LLM correction is most relevant/useful in the edge cases rather than the modal ones, so I totally agree.

link

refulgentis 719 days ago

They take images

link

troysk 719 days ago

How to segment the document without LLM?

I prefer to do all of this in 1 step with an LLM with a good prompt and few shots.

With so many passes with images, the costs/time will be high with ViT being slower.

link

jszymborski 719 days ago

Segmenting can likely be done on a really small resolution and with a CNN, making it real short.

There are some heuristic ways of doing it but i doubt you'll be able to distinguish equations from text.

link

troysk 717 days ago

Segmenting at lower resolution and then using them at higher resolution using resolution multipliers don't work as other items bleed in. FastSAM paper has some interesting ideas on doing this with CNNs which I guess SAM2 have superseded. However, the complication in the pipeline is not worth the result as I find vision LLMs are able to do almost the same task within the same OCR prompt.

link

wahnfrieden 719 days ago

Apple APIs such as Live Text, subject identification, Vision. Run them on a server, too

link

vintermann 719 days ago

I agree that vision models that actually have access to the image are a more sound approach than using OCR and trying to fix it up. It may be more expensive though, and depending on what you're trying to do it may be good enough.

What I want to do is reading handwritten documents from the 18th century, and I feel like the multistep approach hits a hard ceiling there. Transkribus is multistep, but the line detecion model is just terrible. Things that should be easy, such as printed schemas, utterly confuse it. You simply need to be smart about context to a much higher degree than you need in OCR of typewriter-written text.

link

huijzer 719 days ago

I also think it’s probably more effective. Every time hand-crafted tools are better than AI but then the model becomes bigger and AI wins. Think hand crafted image classification to full model or hand crafted language translation to full model.

In this case, the model can already do the OCR and becomes an order of magnitude cheaper per year.

link

kiratp 718 days ago

The Bitter Lesson -> http://www.incompleteideas.net/IncIdeas/BitterLesson.html

link

troysk 717 days ago

both openai and claude vision models are able do that for me. It is more expensive than tesseract which can run on cpu but I assume it will become similarly cheap in the near future with open models and as AI becomes ubiquitous.

link

ChadNauseam 719 days ago

It's not OSS, but I've had good experiences with using MathPix's API for OCR for formulas

link

troysk 719 days ago

nougat, donut are OSS. There are no OSS vision models but we will soon have them. MathPix API are also not OSS and I found them expensive compared to vision models.

Mathpix Markdown however is awesome and I ask LLMs to use that to denote formulas as latex is tricky to render in HTML because of things not matching. I don't know latex well so haven't gone deeper on it.

link

EarlyOom 719 days ago

We've been trying to solve this with https://vlm.run: the idea is to combine the character level accuracy of an OCR pipeline (like Tesseract) with the flexibility of a VLM. OCR pipelines struggle with non-trivial text layouts and don't have any notion of document structure, which means there needs to be another layer on top to actually extract text content to the right place. At the other end of the spectrum, VLMs (like GPT4o) tend to perform poorly on things like dense tables (either hallucinating or giving up entirely) and complex forms, in addition to being much slower/more expensive. Part of the fix is to allow a 'manager' VLM to dispatch to OCR on dense, simple documents, while running charts, graphs etc. through the more expensive VLM pipeline.

link

troysk 719 days ago

Maybe you could try extracting the text also using some pdf text extraction and use that also to compare. Might help fix numbers which tesseract gets wrong sometimes.

link