Y
Hacker News
new
|
ask
|
show
|
jobs
by
chezmo
3624 days ago
We do position based text extraction. We add however an 'unpaper' function which tries to correct misalignments and increases the quality of the scan.
1 comments
ComodoHacker
3624 days ago
What OCR library do you use? What languages it supports?
link
chezmo
3624 days ago
For scanned images we use
https://github.com/tesseract-ocr/tesseract
. For text based PDFs we pull the text directly from the file and all languages are supported.
link