Hacker News new | ask | show | jobs
by yorwba 488 days ago
What would you say is currently the most accurate OCR solution if you're not concerned about speed and memory usage?
4 comments

So, I did some OCR research early last year, that didn't include any VLMs, on some 1960s era English scanned documents with a mix of typed and handwritten (about 80/20), and here's what I found (in terms of cosine similarity):

                  Overall | Handwritten | Typed
  Google Vision:    98.80%  | 93.29%      | 99.37%
  Amazon Texttract: 98.80%  | 95.37%      | 99.15%
  surya:            97.41%  | 87.16%      | 98.48%
  azure:            96.09%  | 92.83%      | 96.46%
  trocr:            95.92%  | 79.04%      | 97.65%
  paddleocr:        92.96%  | 52.16%      | 97.23%
  tesseract:        92.38%  | 42.56%      | 97.59%
  nougat:           92.37%  | 89.25%      | 92.77%
  easy_ocr:         89.91%  | 35.13%      | 95.62%
  keras_ocr:        89.7%   | 41.34%      | 94.71%
Handwritten is a weighted average of Handwritten and typed, I also did Jaccard and Levenshtein distance, but the results were similar enough that just leaving them out for sake of space.

Overall, of you want the best, if you're an enterprise, just use whatever AWS/GCP/Azure you're on, if you're an individual, pick between those. While some of the Open Source solutions do quite well, surya took 188 seconds to process 88 pages on my RTX 3080, while the cloud ones were a few seconds to upload the docs and download them all. But if you do want open source, seriously consider surya, tesseract, and nougat depending on your needs. Surya is the best overall, while nougat was pretty good at handwriting. Tesseract is just blazingly fast, from 121-200 seconds depending on using the tessdata-fast or best, but that's CPU based and it's trivially parallelizeable, and on my 5950X using all the cores, took only 10 seconds to run through all 88 pages.

But really, you need to generate some of your own sample test data/examples and run them through the models to see what's best. Given frankly how little this paper tested, I really should redo my study, add VLMs, and write a small blog/paper, been meaning to for years now.

Ive been looking for handwritten benchmarks for a while and would love to read that blog post.
For handwritten texts, the tool that works best for me is Qwen2.5-VL-72b [0]. It is also available online [1]. I'm surprised that it is not mentioned in the article since even the previous model (Qwen2-VL-72b) was better than the other VLMs I tried for OCR on handwritten texts.

[0]: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct

[1]: https://chat.qwenlm.ai

Not GP but it depends what you mean by accuracy. If you want inference like the 'coconut milk' described then obviously an LLM. If you want accurate as-written transcription, then I don't know the state of the art, but it'll be something purpose built for CV & handwriting recognition.

It'll also depend if you care about tabular data, whether a 'minor' numerical error (like 0 & 8 mismatched sometimes) is significantly worse than a 'typo' as it were in recognising a word, etc.

Accuracy should always work to be the answer you want, which is the most useful answer for applications. That is "coconut milk", not "coconut cnbc". Maybe "cnbc" should even be included, but definitely not replacing the word "milk" in that location.
Lots of factors to rank on but generally speaking I don't find any of the open source options usable. They all take either a long time to tune or are just not accurate enough. Commercial services from one of cloud players has hit the sweet spot for me.