| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rudolph9 539 days ago
	We parse millions of PDFs using Apache Tika and process about 30,000 per dollar of compute cost. However, the structured output leaves something to be desired, and there are a significant number of pages that Tika is unable to parse. https://tika.apache.org/

1 comments

rudolph9 539 days ago

Under the hood tika uses tesseract for ocr parsing. For clarity this all works surprisingly well generally speaking and it’s pretty easy to run your self and order of magnitude cheaper than most services out there.

https://tesseract-ocr.github.io/tessdoc/

link