Hacker News new | ask | show | jobs
by mlissner 1391 days ago
This uses AWS's Textract service, but if you're doing a LOT of extraction, that gets pretty expensive pretty quickly. We do thousands of pages daily on CourtListener.com and created an open source microservice for this purpose. It can take PDFs, DOCX, DOC, TXT, HTML, or a handful of other files and extract the text, doing OCR if necessary:

https://free.law/projects/doctor

We're always looking for more people to use and improve it.