| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by alaanor 125 days ago
	There was so many OCR models released in the past few months, all VLM models and yet none of them handle Korean well. Every time I try with a random screenshot (not a A4 document) they just fail at a "simple" task. And funnily enough Qwen3 8B VL is the best model that usually get it right (although I couldn't get the bbox quite well). Even more funny, whatever is running on an iphone locally on cpu is insanely good, same with google's OCR api. I don't know why we don't get more of the traditional OCR stuff. Paddlepaddle v5 is the closest I could find. At this point, I feel like I might be doing something wrong with those VLMs.

3 comments

Stagnant 125 days ago

Chrome ships a local OCR model for text extraction from PDFs which is better than any of the VLM or open source OCR models i've tried. I had a few hundred gigs of old newspaper scans and after trying all the other options I ended up building a wrapper around the DLL it uses to get the text and bboxes. Performance and accuracy on another level compared to tesseract, and while VLM models sometimes produced good results they just seemed unreliable.

I've thought of open sourcing the wrapper but havent gotten around to it yet. I bet claude code can build a functioning prototype if you just point it to "screen_ai" dir under chrome's user data.

link

alvibo 125 days ago

Is there a chance you'll open source the wrapper after all? It would help a lot of people like me. No pressure though, but now I really want to try it to OCR a bunch of Japanese scans I have lying around. Unfortunately, finding a good OCR for Japanese scans is still a huge problem in 2026.

link

zzleeper 125 days ago

Surprisingly, I have a few hundred gigs of old newspaper scans so am very curious.

How fast was it per page? Do you recall if it's CPU or GPU based? TY!

link

Stagnant 125 days ago

It is CPU-based. Somewhere between 1 to 2 seconds per page on a single core. I ran 20 instances of it in parallel to utilize 20 CPU cores so the avg time came down nicely.

link

zzleeper 120 days ago

That's actually amazing, and might give me a way to use all the cores I have lying around. 2s per page is an insane 600 pages per minute at 20 cores!

Please do open source it, even if you don't do much around it (worst case I can just spend a few million tokens trying to get opus 4.6 to get it to work)

link

mwcampbell 125 days ago

What's the name of this DLL? I assume it's separate from the monster chrome.dll, and that the model is proprietary.

link

Stagnant 125 days ago

chrome_screen_ai.dll is the name of the dll (libchromescreenai.so on linux) and yes it is proprietary. It isn't included by default, Chrome uses its component service to download it automatically when you open a PDF file that doesn't have pre-existing OCR'd text on it. You can download it separately from here: https://chrome-infra-packages.appspot.com/p/chromium/third_p...

link

ghrl 125 days ago

I remember someone building a meme search engine for millions of images using a cluster of used iPhone SE's because of Apple's very good and fast OCR capabilities. Quite an interesting read as well: https://news.ycombinator.com/item?id=34315782

link

fzysingularity 125 days ago

Apple OCR even on the Mac is insanely good, in fact way better than AWS textract/GCP cloud vision OCR.

Any idea what model is being used?

link

AlphaSite 125 days ago

Probably some custom model built for their hardware.

link

deaux 124 days ago

Gemini crushes almost any major script including CJK, even Flash. Not self-hostable though.

link