| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by erulabs 480 days ago
	You sort of have to use both. OCR and LLM and then correlate the two results. They are bad at very different things, but a subsequent call to a 2nd LLM to pair together the results does improve quality significantly, plus you get both document understanding and context as well as bounding boxes, etc. I'm building a "never fill out paperwork again" app, if anyone is interested, would be happy to chat!

3 comments

fzysingularity 480 days ago

We think VLMs would outperform most OCR+LLM solutions in due time. I get that there’s need for these hybrid solutions today, but we’re comparing 20+ year mature tech vs something that’s roughly 1.5 years old.

Also, VLMs are end-to-end trainable, unlike OCR+LLM solutions (that are trained separately), so it’s clear that these approaches scale much better for domain-specific use cases or verticals.

link

cpursley 480 days ago

Any tips on how to prompt that second pairing step? And what sort of things to ask the llm to extract in step 1?

link

K0balt 480 days ago

A VLM that invokes ocr tool use is a compelling idea that could result in pretty good results, I would expect.

link