Hacker News new | ask | show | jobs
by chezmo 3626 days ago
For scanned images we use https://github.com/tesseract-ocr/tesseract. For text based PDFs we pull the text directly from the file and all languages are supported.