| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by willj 803 days ago
	Relatedly, the OCR component relies on PyMuPDF, which has a license that requires releasing source code, which isn’t possible for most commercial applications. Is there any plan to move away from PyMuPDF, or is there a way to use an alternative?

1 comments

kkielhofner 803 days ago

FWIW PyMuPDF doesn't do OCR. It extracts embedded text from a PDF, which in some cases is either non-existent or done with poor quality OCR (like some random implementation from whatever it was scanned with).

This implementation bolts on Tesseract which IME is typically not the best available.

link

serjester 803 days ago

Author here. I’m very open to alternatives to PyMuPDF / tesseract because I agree OCR results are sub optimal and it has a restrictive license. I tried basic ones and found the results to be poor.

link

mcbetz 803 days ago

This article compares multiple solutions and recommends docTR (Apache License 2.0): https://source.opennews.org/articles/our-search-best-ocr-too...

link