Hacker News new | ask | show | jobs
by dmezzetti 884 days ago
Nice project! I've long used Tika for document parsing given it's maturity and wide number of formats supported. The XHTML output helps with chunking documents for RAG.

Here's a couple examples:

- https://neuml.hashnode.dev/build-rag-pipelines-with-txtai

- https://neuml.hashnode.dev/extract-text-from-documents

Disclaimer: I'm the primary author of txtai (https://github.com/neuml/txtai).

1 comments

Off-topic, but do you know how Tika compares to other pdf parsing libraries? I was very unimpressed by pdfminer.six (what unstructured uses) as the layout detection seems pretty basic, it fails to parse multi column text, whereas MuPDF does it perfectly

Currently I'm using a mix of MuPDF + AWS Textract (for tables, mostly) but I'd love to understand what other people are doing

Tika uses PDFBox under the hood, using its built-in text extractor (which is "ok"). If you're looking for table extraction specifically, check out Tabula (https://tabula.technology) which is also built on top of PDFBox and has some contributions from the same maintainers. PDFBox actually exposes a lower-level API for text extraction (I wrote it!) than the one Tabula uses, allowing you to roll your own extractor - but that's where dragons live, trust me :)
I don't have scientific metrics but I've found the quality much better than most. It does a pretty good job to pulling data from text and tables.