Hacker News new | ask | show | jobs
by pc86 4012 days ago
How does the PDF OCR process compare to images? I uploaded a sample PDF with very clear sans-serif text (printed to PDF from a webpage) and there seems to be some odd substitutions. "prohibitecL" instead of "prohibited", "ac" instead of "QC" (as part of an address), random clipping of the first letter in a few lines and random use of a capital i instead of 1.

Overall very good, I'm just wondering if the library is better with image files than PDFs?

1 comments

Interesting... I see it now. I assume some issue during the PDF to image conversion in the web app. PDF support is just a few days old.

The OCR library itself supports only image formats as input and is "innocent" with regards to this issue ;)