| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kkielhofner 807 days ago
	FWIW PyMuPDF doesn't do OCR. It extracts embedded text from a PDF, which in some cases is either non-existent or done with poor quality OCR (like some random implementation from whatever it was scanned with). This implementation bolts on Tesseract which IME is typically not the best available.

1 comments

serjester 807 days ago

Author here. I’m very open to alternatives to PyMuPDF / tesseract because I agree OCR results are sub optimal and it has a restrictive license. I tried basic ones and found the results to be poor.

link

mcbetz 807 days ago

This article compares multiple solutions and recommends docTR (Apache License 2.0): https://source.opennews.org/articles/our-search-best-ocr-too...

link