|
|
|
|
|
by mlissner
1391 days ago
|
|
This uses AWS's Textract service, but if you're doing a LOT of extraction, that gets pretty expensive pretty quickly. We do thousands of pages daily on CourtListener.com and created an open source microservice for this purpose. It can take PDFs, DOCX, DOC, TXT, HTML, or a handful of other files and extract the text, doing OCR if necessary: https://free.law/projects/doctor We're always looking for more people to use and improve it. |
|