|
|
|
|
|
by themanmaran
483 days ago
|
|
Thing is, the majority of OCR errors aren't character issues, but layout issues. Things like complex tables with cells being returned under the wrong header. And if the numbers in an income statement are one column off creates a pretty big risk. Confidence intervals are a red herring. And only as good as the code interpreting them. If the OCR model gives you back 500 words all ranging from 0.70 to 0.95 confidence, what do you do? Reject the entire document if there's a single value below 0.90? If so you'd be passing every single document to a human review, and might as well not run the OCR. But if you're not rejecting based on CI, then you're exposed to just as much risk as using an LLM. |
|
If you try to pitch hallucinations to these fields, they'll just choose 100% manual instead. It's a non-starter.