Hacker News new | ask | show | jobs
by chezmo 3624 days ago
We do position based text extraction. We add however an 'unpaper' function which tries to correct misalignments and increases the quality of the scan.
1 comments

What OCR library do you use? What languages it supports?
For scanned images we use https://github.com/tesseract-ocr/tesseract. For text based PDFs we pull the text directly from the file and all languages are supported.