| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rafram 480 days ago
	It’s an interesting idea, but still way too unreliable to use in production IMO. When a traditional OCR model can’t read the text, it’ll output gibberish with low confidence; when a VLM can’t read the text, it’ll output something confidently made up, and it has no way to report confidence. (You can ask it to, but the number will itself be made up.) I tried using a VLM to recognize handwritten text in genealogical sources, and it made up names and dates that sort of fit the vibe of the document when it couldn’t read the text! They sounded right for the ethnicity and time period but were entirely fake. There’s no way to ground the model using the source text when the model is your OCR.

9 comments

themanmaran 480 days ago

Thing is, the majority of OCR errors aren't character issues, but layout issues. Things like complex tables with cells being returned under the wrong header. And if the numbers in an income statement are one column off creates a pretty big risk.

Confidence intervals are a red herring. And only as good as the code interpreting them. If the OCR model gives you back 500 words all ranging from 0.70 to 0.95 confidence, what do you do? Reject the entire document if there's a single value below 0.90?

If so you'd be passing every single document to a human review, and might as well not run the OCR. But if you're not rejecting based on CI, then you're exposed to just as much risk as using an LLM.

tensor 480 days ago

Having experience in this area, audit, legal, confidence intervals are essential. No, you don't end up "passing every single document" to human review. That's made up nonsense. But confidence intervals can pretty easily flag poorly OCR'd documents, and then yes they are done by human review.

If you try to pitch hallucinations to these fields, they'll just choose 100% manual instead. It's a non-starter.

xattt 480 days ago

I work in a health insurance adjacent field. I can see my work going the way of the dodo as soon as VLLs take off in interpreting historical health records with physicians’ handwriting.

gtirloni 480 days ago

So never considering their handwriting :)

That being said, all doctors I have consulted with in the past year or so used signed electronic prescriptions.

anon373839 480 days ago

> But if you're not rejecting based on CI, then you're exposed to just as much risk as using an LLM.

That's not true. LLMs and OCR have very different failure modes. With LLMs, there is unbounded potential for hallucination, and the entire document is at risk. For example: if something in the lower right-hand corner of the page takes the model to a sparsely sampled part of the latent space, it can end up deciding that it makes sense to rewrite the document title! Or anything else. LLMs also have a pernicious habit of "helpfully" completing partial sentences that appear at the beginning or end of a page of text.

With OCR, errors are localized and have a greater chance of being detected when read.

I think for a lot of cases, the best solution is to fine-tune a model like LayoutLM, which can classify the actual text tokens in a document (whether obtained from OCR or a native text layer) using visual and spatial information. Then, there are no hallucinations and you can use uncertainty information from both the OCR (if used) and the text classification. But it does mean that you have to do the work of annotating data and training a model, rather than prompt engineering...

tensor 479 days ago

100% this, combining traditional OCR with VLMs that can work with bounding boxes so that you can correlate the two is the way to go.

bayindirh 480 days ago

The problem is, regardless of the confidence number, you can scan and mark document for grammatical errors.

In VLM/LLM powered methods, the missing/misred data will be hallucinated and you can't know whether something scanned correctly or not. I personally scan and OCR tons of personal documents, I prefer "gibberish" rather than "hallucinations", because they're easier to catch.

We had this problem before [0], on some Xerox scanners and copiers. Results will be disastrous. It's not a question of if, but when.

I personally tried Gemini and OpenAI's models for OCR, and no, I won't continue using them further.

[0]: https://www.theregister.com/2013/08/06/xerox_copier_flaw_mea...

rafram 480 days ago

Then use an LLM to extract layout information. Don’t trust it to read the text.

> If the OCR model gives you back 500 words all ranging from 0.70 to 0.95 confidence, what do you do? Reject the entire document if there's a single value below 0.90?

No, of course not. You have a human review the words/segments with low confidence.

sudoshred 479 days ago

That’s assuming that confidence intervals are even independently comparable. Anecdotally major OCR services with specific languages have average confidence intervals that are wildly divergent from similar services with different languages for the same relative quality of result. Acting as if confidence interval is in any way absolute or otherwise able to reliably and consistently indicate the relative quality of results is a mischaracterization at best. In the worst case CI is as good as an RNG. The value of the CI is in the ability to tune usage of the results based on observations of the users and characteristics of the request, sometimes it is meaningful but not always. In this case “good” code essentially hardcodes handling for all the idiosyncrasies of the common usage and the OCR service.

constantinum 480 days ago

The primary issue with LLMs is hallucination, which can lead to incorrect data and flawed business decisions.

For example, Llamaparse(https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse...) uses LLMs for PDF text extraction but faces hallucination problems. See this issue for more details: https://github.com/run-llama/llama_parse/issues/420.

For those interested, try LLMWhisperer(https://unstract.com/llmwhisperer/) for OCR. It avoids LLMs, eliminates hallucination issues, and preserves the input document layout for better context.

Examples of extracting complex layout:

https://imgur.com/a/YQMkLpA

https://imgur.com/a/NlZOrtX

https://imgur.com/a/htIm6cf

Hackbraten 480 days ago

> try LLMWhisperer(https://unstract.com/llmwhisperer/) for OCR. It avoids LLMs

The website you linked says it uses LLMs?

constantinum 480 days ago

The tool doesn't use any LLMs for processing/parsing the data. It parses and converts into raw text.

The final output(raw text) of the parsing is then fed to LLMs for data extraction. e.g. Extracting data from insurance, banking, and invoice documents.

ungerik 480 days ago

Those images look exactly like what you get from every OCR tool out there if you use the XY information.

EarlyOom 480 days ago

This is the main focus of VLM Run and typed extraction more generally. If you provide proper type constraints (e.g. with Pydantic) you can dramatically reduce the surface area for hallucination. Then there's actually fine-tuning on your dataset (we're working on this) to push accuracy beyond what you get from an unspecialized frontier model.

rafram 480 days ago

Re type constraints: Not really. If one of the fields in my JSON is `name` but the model can’t read the name on the page, it will very happily make one up. Type constraints are good for making sure that your data is parseable, but they don’t do anything to fix the undetectable inaccuracy problem.

Fine-tuning does help, though.

fzysingularity 480 days ago

Yes, both false positives and false negatives like the one you mentioned happens when the schema is sometimes ill-defined. Making name optional via `name: str | None` actually turns out ensure that the model only fills it if it’s certain that field exists.

These are some of the nuances we had to work with during VLM fine-tuning with structured JSON.

rafram 480 days ago

You seem to be missing my point.

hashta 480 days ago

An effective way that usually increases accuracy is to use an ensemble of capable models that are trained independently (e.g., gemini, gpt-4o, qwen). If >x% of them have the same output, accept it, otherwise reject and manually review

rafram 480 days ago

There’s a very low chance that three separate models will come up with the same result. There are always going to be errors, small or large. Even if you find a way around that, running the process three times on every page is going to be prohibitively expensive, especially if you want to finetune.

vintermann 480 days ago

No, running it two or three times for every page isn't prohibitive. In fact, one of the arguments for using modern general-purpose multimodal models for historical HTR is that it is cheaper and faster than Transkribus.

What you can do is for instance to ask one model for a transcription, and ask a second model to compare the transcription to the image and correct any errors it finds. You actually have a lot of budget to try things like these if the alternative is to fine-tune your own model.

jjk166 479 days ago

The odds of them getting the same result for any given patch should be very high if it is the correct result and they aren't garbage. The only times where they are not getting the same result would be the times when at least one has made a mistake. The odds of 3 different models making the same mistake should be low (unless it's something genuinely ambiguous like 0 vs O in a random alphanumeric string).

Best 2 out of 3 should be far more reliable than any model on its own. You could even weight their responses for different types of results, like say model B is consistently better for serif fonts, maybe their confidence counts for 1.5 times as much as the confidence of models A and C.

refulgentis 480 days ago

That's not OCR.

It is an absolute miracle.

It is transmutating a picture into JSON.

I never thought this would be possible in my lifetime.

But that is different from what your interlocutor is discussing.

1024core 480 days ago

> I never thought this would be possible in my lifetime.

I used to work in Computer Vision and Image Processing. These days I utter this sentence on an almost daily basis. :-D

KoolKat23 480 days ago

I've been using gemini 2 flash to extract financial data, within my sample which is perhaps small (probably 1000 entries so far), I've had one single error only so like a 99.9% success rate.

(There's slightly more errors if I ask it to add numbers but this isn't OCR and a bit more of a reach, although it is very good at this too regardless).

Many hallucinations can be avoided by telling it to use null if there is no number present.

CarolineRommer 479 days ago

And by using two different systems (say Gem plus ChatGPT) you essentially reduce chances of hallucination to zero, no? You would need to be VERY unlucky to find to LLMs hallucinating the exact same response.

cratermoon 480 days ago

Agree wholeheartedly. Modern OCR is astonishingly good, more importantly it's deterministically so. It's failure modes, when it's unable to read the text, are recognizably failures.

Results for VLM accuracy & precision are not good. https://arxiv.org/html/2406.04470v1#S4

VeejayRampay 479 days ago

which solutions would you classify as "modern OCR"

are we talking tesseract or something?

criddell 479 days ago

Probably something like Apple Vision Framework or Amazon Textract or Google's Cloud Vision.

Tesseract does well under ideal conditions, but the world is messy.

cratermoon 479 days ago

I was thinking ABBYY FineReader, but those, too. Instead of using VLMs or any sort of generative AI, they're build on good old-fashioned feature extraction and nearest neighbor classifiers such as the k-nearest neighbors algorithm. It's possible to build a working prototype of this technique using basic ML algorithms.

delichon 480 days ago

How about calculating confidence in terms of which output regions are stable across the same input on multiple tries. Expensive, but the hallucinations should have more variable output and be fuzzier than higher confidence regions in averages.

staticman2 480 days ago

I think it would be pretty reliable in controlled circumstances. If I take a picture of a book with my cell phone- google Gemini pro is much better at recognizing the text than Samsung's built in OCR.

Grimblewald 480 days ago

I would think the same, the cause for hesitation is that we only think this, but cannot know it without thorough testing. Right now the scope of problems where things behave reliably and as expected and scope of problems where things get whacky are unknown. The borders are known to some rather fuzzy extent at best, by people who work with these things as a full-time job. This means we are just blindly gambling on it. For important things, archiving, etc. where truth matters, I will continue using traditional OCR until we can define the reliable use-case scope of LLM based OCR better. I am extremely enthusiastic about LLM's and the things these offer, but i am also a realist. LLM's are an infant technology, and no-where near the level of maturity that companies like openAI claim.

the8472 479 days ago

Shouldn't confidence be available at the sampler level and also be conditional on the vision input, not just the next-token prediction?

j_bum 480 days ago

This is naive, but can you ask the model to provide a confidence rating for sections of the document?

thatjoeoverthr 480 days ago

More broadly, it’s not trained to have any self awareness and this is a factor in other “hallucinations”. If you ask, for example, to describe the “marathon crater”, it doesn’t recognize that there’s no such thing in its corpus, but will instead start by writing an answer (“sure! The marathon crater is..”) and freestyle from there. Same if you ask it why it did something, or details about itself, etc. You should access one directly (not through an app like chatGPT) and build a careful suite of tests to learn more. Really fascinating.

_delirium 479 days ago

Yes, there’s research showing that models’ self-assessment of probabilities (when you ask them via prompting) don’t even match the same models’ actual probabilities, in cases where you can measure the probabilities directly (e.g. by looking at the logits): https://arxiv.org/abs/2305.13264

anon291 479 days ago

Logits are not probabilities... at least not in the way you understand probability. Probabilities mathematically are anything that broadly behaves like a probability, whereas colloquially probabilities represent the likelihood or the preponderance of a particular phenomenon. Logits are not either of those.

_delirium 479 days ago

The probability of token generation is a function of the logits. Do you have an actual point related to the linked paper?

anon291 478 days ago

That is one way of sampling tokens. It is not the only way. Logits do not map neatly to belief, although it is convenient to behave as if they do

UnlockedSecrets 480 days ago

You can ask, and it will be made up not grounded in reality

j_bum 480 days ago

Sure, but I’m curious if it would serve to provide some self-regulation.

E.g., all of this “thinking” trend that’s happening. It would be interesting if the model does a first pass, scored its individual outputs, then reviews its scores and censors/flags scores that are low.

I know it’s all “made up”, but generally I have a lot of success asking the model to give 0-1 ratings on confidence for its answers, especially for new niche questions that are likely out of the training set.

rafram 480 days ago

It doesn’t. Asking for confidence doesn’t prompt it to make multiple passes, and there’s no real concept of “passes” when you’re talking about non-reasoning models. The model takes in text and image tokens and spits out the text tokens that logically follow them. You can try asking it to think step by step, or you can use a reasoning model that essentially bakes that behavior into the training data, but I haven’t found that to be very useful for OCR tasks. If the encoded version of your image doesn’t resolve to text in the model’s latent space, it never will, no matter how much the model “reasons” (spits out intermediate text tokens) before giving a final answer.

ttyprintk 480 days ago

It’s not naive; tesseract does this.

rafram 480 days ago

Tesseract doesn’t use an LLM. LLMs don’t know how confident they are; Tesseract’s model does.

touisteur 480 days ago

With most Machine Learning algorithms I used to get shapley values or other 'explainable AI' metrics (for a large cost compared to simple inference, yes), it's very unsettling and frustrating to work without them now on LLMs.

hansvm 480 days ago

Kind of. Tesseract's confidence is just a raw model probability output. You could easily use the entropy associated with each token coming out of an LLM to do the same thing.

rafram 479 days ago

True, but LLM token probability doesn't map nearly as cleanly to "how readable was the text".

hansvm 479 days ago

Why not though? Both kinds of models jumble around the data and spit out a probability distribution. Why is the tesseract distribution inherently more explainable (aside from the UI/UX problem of the uncertainty being per-token instead of per-character)?