| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by lmeyerov 878 days ago
	Tesseract OCR fallback sounds great! There are now a lot of file loaders for RAG (langchain, LLMindex, unstructured, ...), any reasons, like a leading benchmark score, to prefer this one?

2 comments

mpeg 878 days ago

I couldn't try this tool as it doesn't build on apple silicon (and there's no ARM docker image)

However, I have a PDF parsing use-case that I tried those RAG tools for, but the output they give me is pretty low quality – it kinda works for RAG as the LLM can work around the issues but if you want to get higher quality responses with proper references and such I think the best way is to write your own rule-based parser which is what I ended up doing (based on MuPDF though, not Tika).

Maybe that's what the authors of this tool were thinking too.

link

asukla 878 days ago

To run the docker image on apple silicon, you can use the following command to pull - it will be slower but works: docker pull --platform linux/x86_64 ghcr.io/nlmatics/nlm-ingestor:latest

link

mpeg 878 days ago

Thanks, I always forget I can do that! I've given it a go and it's really impressive – the default chunker is very smart and manages to keep most of the chunk context together

The table parser in particular is really good. Is the trick that you draw some guide lines and rectangles around tables? I'm trying to understand the GraphicsStreamProcessor class as I'm not familiar with Tika, how does it know where to draw in the first place?

link

ramoz 878 days ago

For me, PyMuPDF/fitz has been the best way to retain natural reading order and set dynamic enough rules to extract text in complex layouts.

None of the mentioned tools did this out of the box, none seemed easy to configured, all definitely hyped and marketed way beyond fitz though.

link

mpeg 878 days ago

Same here, fitz is great, it does well enough out of the box that I can apply some simple heuristics for things like joining/splitting paragraphs where it makes a mistake and extract drawings and such and get pretty close to 100% accuracy on the output.

The only thing it doesn't do is tables detection (neither does pdfminer.six), but there are plenty of other ways to handle them.

link

rmsaksida 878 days ago

Last time I tried Langchain (admittedly, that was ~6 months ago) the implementations for content extraction from PDFs and HTML files were very basic. Enough to get a prototype RAG solution going, but not enough to build anything reliable. This looks like a much more battle-tested implementation.

link