Hacker News new | ask | show | jobs
by guidedlight 879 days ago
How does this differ from Azure Document Intelligence, or are they effectively the same thing?
4 comments

No, we are not doing the same thing. Most cloud parsers use a vision model and they are lot slower, expensive and you need to write code on the top of these to extract good chunks.

You can use llmsherpa library - https://github.com/nlmatics/llmsherpa with this server to get nice layout friendly chunks for your LLM/RAG project.

There’s no ocr or ai involved here (other than the standard fallback).

What this library, and something like fitz/pymupdf, allow you to do is extract the text straight from the pdf, using rules about how to parse & structure it. (Most modern pdfs you can extract text without ocr).

- much cheaper obviously but doesn’t scale (across dynamic layouts) well so you likely are using this when you can configure around a standard structure. I have found rule-based text extraction to work fairly dynamically though for things like scientific pdfs.

Last I used it, Azure Document Intelligence wasn't all that smart about choosing split points. This seems to implement better heuristics.
I wrote about split points and the need for including section hierarchy in this post: https://ambikasukla.substack.com/p/efficient-rag-with-docume...

All this is automated in the llmsherpa parser https://github.com/nlmatics/llmsherpa which you can use as an API over this library.

What is a split point? I use Textract a lot and from my testing, always beats out any of the open source tooling to extract information. That could also be highly dependent on the document format.
I think it is a reference to the place a larger document is split into chunks for calculating embeddings and storage.
I am also curious about this. ADI is reliable but does have edge case issues on malformed PDF

I fear tesseract OCR is a potential limitation though. I’ve seen it make so many mistakes