|
|
|
|
|
by jahewson
878 days ago
|
|
Tika uses PDFBox under the hood, using its built-in text extractor (which is "ok"). If you're looking for table extraction specifically, check out Tabula (https://tabula.technology) which is also built on top of PDFBox and has some contributions from the same maintainers. PDFBox actually exposes a lower-level API for text extraction (I wrote it!) than the one Tabula uses, allowing you to roll your own extractor - but that's where dragons live, trust me :) |
|