|
|
|
|
|
by rudolph9
491 days ago
|
|
We parse millions of PDFs using Apache Tika and process about 30,000 per dollar of compute cost. However, the structured output leaves something to be desired, and there are a significant number of pages that Tika is unable to parse. https://tika.apache.org/ |
|
https://tesseract-ocr.github.io/tessdoc/