| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by chezmo 3624 days ago
	We do position based text extraction. We add however an 'unpaper' function which tries to correct misalignments and increases the quality of the scan.

1 comments

What OCR library do you use? What languages it supports?

For scanned images we use https://github.com/tesseract-ocr/tesseract. For text based PDFs we pull the text directly from the file and all languages are supported.