Hacker News new | ask | show | jobs
by spwa4 488 days ago
This looks a lot like "compared to a bunch of people who are 10 years behind (non-transformer, vision-only models), and people who aren't trying (aren't optimizing for OCR) Google is doing real well"

EasyOCR is LSTM-CTC from 2007, RapidOCR is a ConvNet approach from 2021, both focused on speed. Both will vastly outperform almost any transformer model, and certainly a big one, on speed and memory usage, but they aren't state of the art on accuracy. This is well known, for a decade at this point. 2 decades for LSTM-CTC.

Plus, I must say the GPT-4o results look a lot saner. "COCONUT" (GPT-4o) vs "CONU CNBC" (Gemini) vs Ground Truth "C CONU CNBC". And, obviously the ground truth should be "COCONUT MILK" (the word milk is almost entirely out of the picture, but is still the right answer that a human would give). The "C CONU" comes from the first O of COCONUT being somewhat obscured by a drawing of ... I don't know what the hell that is. It's still very obvious it's meant to be "COCONUT MILK", so the GPT-4o answer is still not quite perfect, but heaps better than all the others.

Now this looks very much like it might be temperature related, and I can find nothing in the paper about changing the temperature, which is imho a very big gap (temperature gives transformer models more freedom to choose more creative answers. The better performance of GPT-4o might well be the result of such a more creative choice, and might also explain why Gemini is trying so hard to stay so very close to the ground truth. It's still quite the accomplishment to succeed, but GPT-4o is still better)

7 comments

> And, obviously the ground truth should be "COCONUT MILK" (the word milk is almost entirely out of the picture, but is still the right answer that a human would give).

Maybe? Seems application-dependent to me.

If you're OCRing checks or invoices or car license plates or tables in PDF documents, you might prefer a model that's more conservative when it comes to filling in the blanks!

And even when recognising packaged coconut products, you've also got your organic coconut oil, organic coconut milk with reduced fat, organic coconut cream, organic coconut flakes, organic coconut dessicated chips, organic coconut and strawberry bites, organic coconut milk powder, organic coconut milk block, organic coconut milk 9% fat, organic coconut yoghurt, organic coconut milk long life barista-style drink, organic coconut kefir, organic coconut banana and pear baby food pouches, organic coconut banana and pineapple smoothie, organic coconut scented body wash and so on.

>The "C CONU" comes from the first O of COCONUT being somewhat obscured by a drawing of ... I don't know what the hell that is.

It's clearly the stem from the bell pepper in front of the can. You're complaining that the software is lesser than a human, yet it appears your human needs better training in understanding context too.

Why would a can of coconut milk have a drawing of a bell pepper obscuring the writing? How does that make ANY sense at all?
Yup, definitely the human needs better context training. Then again, for an account that's only 6 months old, it's possible you're not really a human.

Edit to insert: WHAT DRAWING? There's a can of coconut milk that is turned so the word coconut is not fully visible. In front of that can is a real red bell pepper with a green stem still attached that is partially obstructed by the bowls in the foreground. What you're attempting to claim as a drawing is just a real life object in the table top setup. Since this is a CNBC branding image, I'm assuming this is a still frame from a video clip. Based on being a video type person, this view probably changes based on time with different things being obstructed/revealed by the camera's movement.

Your RLHF could really use some improvement. To be this argumentative when you're clearly wrong is quite amusing, but not in an entertaining way. It just reinforces my sentiments towards the joke the industry has become

The question is what is OCR for ? If it's to answer questions and work with a document, then VLMs do actually contain self correcting mechanisms. That is, the end to end image + text input to text output is statistically grounded, by training. So the question to ask is what do you need OCR for ? Fedding an LLM? Then feed it to the VLM instead. Some other usage ? Well, to be decided. But near now, CTX and lstms are done with, because VLMs do everything: finding the area to read, reading, embedding, and answering. OCR was a mid-step, it's going away.
It's not obvious at all—it depends on the use case.

You also didn’t really counter the paper. Sure, the OCR models are old, but what should they have tested instead? Are there better open-source OCR models available that would have made for a fairer comparison?

This is what's so terrifying about uses of "AI". People's idea of accuracy being "tell me what I think is there", not "tell me what's there". The can in this image probably says "coconut milk", but the image certainly doesn't.
I think it's useful to add the context that CNBC is correct and does appear at the top right of that picture. CNBC is not a mis-transcribing of MILK, and the letters M, I, L and K are not actually visible in the picture.
What would you say is currently the most accurate OCR solution if you're not concerned about speed and memory usage?
So, I did some OCR research early last year, that didn't include any VLMs, on some 1960s era English scanned documents with a mix of typed and handwritten (about 80/20), and here's what I found (in terms of cosine similarity):

                  Overall | Handwritten | Typed
  Google Vision:    98.80%  | 93.29%      | 99.37%
  Amazon Texttract: 98.80%  | 95.37%      | 99.15%
  surya:            97.41%  | 87.16%      | 98.48%
  azure:            96.09%  | 92.83%      | 96.46%
  trocr:            95.92%  | 79.04%      | 97.65%
  paddleocr:        92.96%  | 52.16%      | 97.23%
  tesseract:        92.38%  | 42.56%      | 97.59%
  nougat:           92.37%  | 89.25%      | 92.77%
  easy_ocr:         89.91%  | 35.13%      | 95.62%
  keras_ocr:        89.7%   | 41.34%      | 94.71%
Handwritten is a weighted average of Handwritten and typed, I also did Jaccard and Levenshtein distance, but the results were similar enough that just leaving them out for sake of space.

Overall, of you want the best, if you're an enterprise, just use whatever AWS/GCP/Azure you're on, if you're an individual, pick between those. While some of the Open Source solutions do quite well, surya took 188 seconds to process 88 pages on my RTX 3080, while the cloud ones were a few seconds to upload the docs and download them all. But if you do want open source, seriously consider surya, tesseract, and nougat depending on your needs. Surya is the best overall, while nougat was pretty good at handwriting. Tesseract is just blazingly fast, from 121-200 seconds depending on using the tessdata-fast or best, but that's CPU based and it's trivially parallelizeable, and on my 5950X using all the cores, took only 10 seconds to run through all 88 pages.

But really, you need to generate some of your own sample test data/examples and run them through the models to see what's best. Given frankly how little this paper tested, I really should redo my study, add VLMs, and write a small blog/paper, been meaning to for years now.

Ive been looking for handwritten benchmarks for a while and would love to read that blog post.
For handwritten texts, the tool that works best for me is Qwen2.5-VL-72b [0]. It is also available online [1]. I'm surprised that it is not mentioned in the article since even the previous model (Qwen2-VL-72b) was better than the other VLMs I tried for OCR on handwritten texts.

[0]: https://huggingface.co/Qwen/Qwen2.5-VL-72B-Instruct

[1]: https://chat.qwenlm.ai

Not GP but it depends what you mean by accuracy. If you want inference like the 'coconut milk' described then obviously an LLM. If you want accurate as-written transcription, then I don't know the state of the art, but it'll be something purpose built for CV & handwriting recognition.

It'll also depend if you care about tabular data, whether a 'minor' numerical error (like 0 & 8 mismatched sometimes) is significantly worse than a 'typo' as it were in recognising a word, etc.

Accuracy should always work to be the answer you want, which is the most useful answer for applications. That is "coconut milk", not "coconut cnbc". Maybe "cnbc" should even be included, but definitely not replacing the word "milk" in that location.
Lots of factors to rank on but generally speaking I don't find any of the open source options usable. They all take either a long time to tune or are just not accurate enough. Commercial services from one of cloud players has hit the sweet spot for me.