| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by breadislove 242 days ago

For everyone wondering how good this and other benchmarks are:

- the OmniAI benchmark is bad

- Instead check OmniDocBench[1] out

- Mistral OCR is far far behind most Open Source OCR models and even further behind then Gemini

- End to End OCR is still extremely tricky

- composed pipelines work better (layout detection -> reading order -> OCR every element)

- complex table parsing is still extremely difficult

[1]: https://github.com/opendatalab/OmniDocBench

2 comments

hakunin 242 days ago

Wish someone benchmarked Apple Vision Framework against these others. It's built into most Apple devices, but people don't know you can actually harness it to do fast, good quality OCR for you (and go a few extra steps to produce searchable pdfs, which is my typical use case). I'm very curious where it would fall in the benchmarks.

wahnfrieden 242 days ago

It is unusable trash for languages with any vertical writing such as Japanese. It simply doesn’t work.

thekid314 242 days ago

Yeah, and fails quickly at anything handwritten.

hakunin 242 days ago

I mostly OCR English, so Japanese (as mentioned by parent) wouldn't be an issue for me, but I do care about handwriting. See, these insights are super helpful. If only there was, say, a benchmark to show these.

My main question really is: what are practical OCR tools that I can string together on my MacBook Pro M1 Max w/ 64GB Ram to maximize OCR quality for lots of mail and schoolwork coming into my house, all mostly in English.

I use ScanSnap Manager with its built in OCR tools, but that's probably super outdated by now. Apple Vision does way better job than that. I heard people say also that Apple Vision is better than Tesseract. But is there something better still that's also practical to run in a scripted environment on my machine?

wahnfrieden 242 days ago

LiveText too? It has a newer engine

hakunin 242 days ago

This is the second comment of yours about LiveText (this is the older one https://news.ycombinator.com/item?id=43192141) — I found that one by complete coincidence because I'm trying to provide a Ruby API for these frameworks. However, I can't find much info on LiveText? What framework is it part of? Do you have any links or any additional info? I found a source where they say it's specifically for screen and camera capturing.

wahnfrieden 242 days ago

https://developer.apple.com/documentation/visionkit/imageana... VisionKit. Swift-only (as with many new APIs) so lots of people stuck on ObjC bridges simply ignore it.

It does not provide bounding boxes but you can get text.

graeme 242 days ago

Interesting. How do you harness it for that purpose? I've found apple ocr to be very good.

hakunin 242 days ago

The short answer is a tool like OwlOCR (which also has CLI support). The long answer is that there are tools on github (I created the stars list: https://github.com/stars/maxim/lists/apple-vision-framework/) that try to use the framework for various things. I’m also trying to build an ffi-based Ruby gem that provides convenient access in Ruby to the framework’s functionality.

ah27182 241 days ago

Apple shortcuts allows you to use OCR on images you pass into it. Looking for “ Extract Text from Image”

CaptainOfCoit 242 days ago

Yeah, if it was cross-platform maybe more people would be curious about it, but something that can only run on ~10% of the hardware people have doesn't make it very attractive to even begin to spend time on Apple-exclusive stuff.

ch1234 242 days ago

But you can have an apple device deployed in your stack to handle the OCR, right? I get on-device is a hardware limitation for many, but if you have an apple device in your stack, can’t you leverage this?

CaptainOfCoit 242 days ago

Yeah, but handling macOS is a infrastructure-capacity sucks, Apple really doesn't want you to so tooling is almost none existing. I've setup CI/CD stacks before that needed macOS builders and it's always the most cumbersome machines to manage as infrastructure.

coder543 242 days ago

AWS literally lets you deploy Macs as EC2 instances, which I believe includes all of AWS's usual EBS storage and disk imaging features.

CaptainOfCoit 242 days ago

Alright, so now the easy thing is done, now how do you actually manage them, keep them running and do introspection without resorting to SSH or even remote desktop?

hakunin 242 days ago

10% of hardware is an insanely vast amount, no?

CaptainOfCoit 242 days ago

Well, it's 90% less than what everyone else uses, so even if the total number is big, relatively it has a small user-base.

hakunin 242 days ago

I don’t think 10% of anything would be considered relatively small even if we talk about 10 items: literally there’s only 10 items and this 1 has the rare quality of being among 10. Let alone billions of devices. Unless you want to reduce it to tautology, and instead of answering “why it’s not benchmarked” just go for “10 is smaller than 90, so I’m right”.

My point is, I don’t think any comparative benchmark would ever exclude something based on “oh it’s just 10%, who cares.” I think the issue is more that Apple Vision Framework is not well known as an OCR option, but maybe it’s starting to change.

And another part of the irony is that Apple’s framework probably gets way more real world usage in practice than most of the tools in that benchmark.

CaptainOfCoit 242 days ago

The initial wish was that more people cared about Apple Vision Framework, I'm merely claiming that since most people don't actually have Apple hardware, they're avoiding Apple technology as it commonly only runs on Apple hardware.

So I'm not saying it should be excluded because it's can only used by relatively few people, but I was trying to communicate that I kind of get why not so many people care about it and why it gets forgotten, since most people wouldn't be able to run it even if they wanted to.

Instead, something like DeepSeek OCR could be deployed on any of the three major OSes (assuming there is implementations of the architecture available), so of course it gets a lot more attention and will be included in way more benchmarks.

cheema33 242 days ago

> the OmniAI benchmark is bad

According to Omni OCR benchmark, Omni OCR is the best OCR. I am sure you all will find no issues with these findings.