Hacker News new | ask | show | jobs
by lou1306 691 days ago
If the PDFS are textual or have OCR, then pdf2text from the Poppler suite ought to be enough? If not, add Tesseract/ocrmypdf to the pipeline?