Nice project! I've long used Tika for document parsing given it's maturity and wide number of formats supported. The XHTML output helps with chunking documents for RAG.
Off-topic, but do you know how Tika compares to other pdf parsing libraries? I was very unimpressed by pdfminer.six (what unstructured uses) as the layout detection seems pretty basic, it fails to parse multi column text, whereas MuPDF does it perfectly
Currently I'm using a mix of MuPDF + AWS Textract (for tables, mostly) but I'd love to understand what other people are doing
Tika uses PDFBox under the hood, using its built-in text extractor (which is "ok"). If you're looking for table extraction specifically, check out Tabula (https://tabula.technology) which is also built on top of PDFBox and has some contributions from the same maintainers. PDFBox actually exposes a lower-level API for text extraction (I wrote it!) than the one Tabula uses, allowing you to roll your own extractor - but that's where dragons live, trust me :)
Currently I'm using a mix of MuPDF + AWS Textract (for tables, mostly) but I'd love to understand what other people are doing