| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by anon373839 480 days ago

> But if you're not rejecting based on CI, then you're exposed to just as much risk as using an LLM.

That's not true. LLMs and OCR have very different failure modes. With LLMs, there is unbounded potential for hallucination, and the entire document is at risk. For example: if something in the lower right-hand corner of the page takes the model to a sparsely sampled part of the latent space, it can end up deciding that it makes sense to rewrite the document title! Or anything else. LLMs also have a pernicious habit of "helpfully" completing partial sentences that appear at the beginning or end of a page of text.

With OCR, errors are localized and have a greater chance of being detected when read.

I think for a lot of cases, the best solution is to fine-tune a model like LayoutLM, which can classify the actual text tokens in a document (whether obtained from OCR or a native text layer) using visual and spatial information. Then, there are no hallucinations and you can use uncertainty information from both the OCR (if used) and the text classification. But it does mean that you have to do the work of annotating data and training a model, rather than prompt engineering...

1 comments

tensor 479 days ago

100% this, combining traditional OCR with VLMs that can work with bounding boxes so that you can correlate the two is the way to go.

link