| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vikp 807 days ago

This looks great! You might be interested in surya - https://github.com/VikParuchuri/surya (I'm the author). It does OCR (much more accurate than tesseract), layout analysis, and text detection.

The OCR is slow on CPU (working on it), but faster than tesseract (CPU-only) on GPU.

You could probably replace pymupdf, tesseract, and some layout heuristics with this.

Happy to discuss more, feel free to email me (in profile).

2 comments

nicklo 807 days ago

OP: please don't poison your MIT license w/ surya's GPL license

link

vikp 804 days ago

It should be possible to call a GPL library in a separate process (surya can batch process from the CLI) and avoid GPL - ocrmypdf does this with ghostscript.

link

barfbagginus 806 days ago

Can I send a PR extending the benchmark against doctr and potentially textract? I believe these represent the SOTA for open and proprietary OCR.

The benefit is to let people evaluate surya against the open source and commercial SOTA, improving the integrity and applicability of the benchmark in a business or research setting.

There's a risk: it could make surya's benchmark look less attractive. Also, picking textract to represent the proprietary SOTA might be dicey, since it has competitors (Google cloud ocr, Azure ocr)

Still, ranking surya with doctr, textract, and tesseract would be really nice baseline. As a research user, business user or open source contributor, those are the results I need to quickly understand surya's potential.

link

vikp 804 days ago

I've benchmarked against google cloud ocr, but the results are on Twitter, not the repo yet - https://twitter.com/VikParuchuri/status/1765440195124691339 . The reason I didn't benchmark against doctr is language support.

link