| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by giamma 289 days ago
	How about using something like Apache Tika for extracting text from multiple documents? It's a subproject of Lucene and consists of a proxy parser + delegates for a number of document formats. If a document, e.g. PDF, comes from a scanner, Tika can optionally shell-out a Tesseract invocation and perform OCR for you.

1 comments

The Tika's documentation is abysmal. Maybe it is a great product but we had to scrap it because of this.