| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mlissner 1438 days ago

This uses AWS's Textract service, but if you're doing a LOT of extraction, that gets pretty expensive pretty quickly. We do thousands of pages daily on CourtListener.com and created an open source microservice for this purpose. It can take PDFs, DOCX, DOC, TXT, HTML, or a handful of other files and extract the text, doing OCR if necessary:

https://free.law/projects/doctor

We're always looking for more people to use and improve it.