|
|
|
|
|
by mindcrime
4533 days ago
|
|
I won't swear to it, but I suspect you're going to have to largely roll your own, and that it will be at least partly heuristic driven. I use Apache Tika[1] to extract text from PDFs and then index it with Lucene, but we don't need to discriminate between various chapters or anything. But I can picture how you could use OpenNLP[2] and some custom code, to break down the chapters. [1]: http://tika.apache.org [2]: http://opennlp.apache.org |
|