I couldn't try this tool as it doesn't build on apple silicon (and there's no ARM docker image)
However, I have a PDF parsing use-case that I tried those RAG tools for, but the output they give me is pretty low quality – it kinda works for RAG as the LLM can work around the issues but if you want to get higher quality responses with proper references and such I think the best way is to write your own rule-based parser which is what I ended up doing (based on MuPDF though, not Tika).
Maybe that's what the authors of this tool were thinking too.
To run the docker image on apple silicon, you can use the following command to pull - it will be slower but works:
docker pull --platform linux/x86_64 ghcr.io/nlmatics/nlm-ingestor:latest
Thanks, I always forget I can do that! I've given it a go and it's really impressive – the default chunker is very smart and manages to keep most of the chunk context together
The table parser in particular is really good. Is the trick that you draw some guide lines and rectangles around tables? I'm trying to understand the GraphicsStreamProcessor class as I'm not familiar with Tika, how does it know where to draw in the first place?
Same here, fitz is great, it does well enough out of the box that I can apply some simple heuristics for things like joining/splitting paragraphs where it makes a mistake and extract drawings and such and get pretty close to 100% accuracy on the output.
The only thing it doesn't do is tables detection (neither does pdfminer.six), but there are plenty of other ways to handle them.
Last time I tried Langchain (admittedly, that was ~6 months ago) the implementations for content extraction from PDFs and HTML files were very basic. Enough to get a prototype RAG solution going, but not enough to build anything reliable. This looks like a much more battle-tested implementation.
However, I have a PDF parsing use-case that I tried those RAG tools for, but the output they give me is pretty low quality – it kinda works for RAG as the LLM can work around the issues but if you want to get higher quality responses with proper references and such I think the best way is to write your own rule-based parser which is what I ended up doing (based on MuPDF though, not Tika).
Maybe that's what the authors of this tool were thinking too.