| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by mindcrime 4533 days ago

I won't swear to it, but I suspect you're going to have to largely roll your own, and that it will be at least partly heuristic driven. I use Apache Tika[1] to extract text from PDFs and then index it with Lucene, but we don't need to discriminate between various chapters or anything. But I can picture how you could use OpenNLP[2] and some custom code, to break down the chapters.

[1]: http://tika.apache.org

[2]: http://opennlp.apache.org