Hacker News new | ask | show | jobs
by willj 803 days ago
Relatedly, the OCR component relies on PyMuPDF, which has a license that requires releasing source code, which isn’t possible for most commercial applications. Is there any plan to move away from PyMuPDF, or is there a way to use an alternative?
1 comments

FWIW PyMuPDF doesn't do OCR. It extracts embedded text from a PDF, which in some cases is either non-existent or done with poor quality OCR (like some random implementation from whatever it was scanned with).

This implementation bolts on Tesseract which IME is typically not the best available.

Author here. I’m very open to alternatives to PyMuPDF / tesseract because I agree OCR results are sub optimal and it has a restrictive license. I tried basic ones and found the results to be poor.
This article compares multiple solutions and recommends docTR (Apache License 2.0): https://source.opennews.org/articles/our-search-best-ocr-too...