| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by guidedlight 879 days ago
	How does this differ from Azure Document Intelligence, or are they effectively the same thing?

4 comments

asukla 878 days ago

No, we are not doing the same thing. Most cloud parsers use a vision model and they are lot slower, expensive and you need to write code on the top of these to extract good chunks.

You can use llmsherpa library - https://github.com/nlmatics/llmsherpa with this server to get nice layout friendly chunks for your LLM/RAG project.

link

ramoz 878 days ago

There’s no ocr or ai involved here (other than the standard fallback).

What this library, and something like fitz/pymupdf, allow you to do is extract the text straight from the pdf, using rules about how to parse & structure it. (Most modern pdfs you can extract text without ocr).

- much cheaper obviously but doesn’t scale (across dynamic layouts) well so you likely are using this when you can configure around a standard structure. I have found rule-based text extraction to work fairly dynamically though for things like scientific pdfs.

link

StrauXX 879 days ago

Last I used it, Azure Document Intelligence wasn't all that smart about choosing split points. This seems to implement better heuristics.

link

asukla 878 days ago

I wrote about split points and the need for including section hierarchy in this post: https://ambikasukla.substack.com/p/efficient-rag-with-docume...

All this is automated in the llmsherpa parser https://github.com/nlmatics/llmsherpa which you can use as an API over this library.

link

infecto 878 days ago

What is a split point? I use Textract a lot and from my testing, always beats out any of the open source tooling to extract information. That could also be highly dependent on the document format.

link

batch12 878 days ago

I think it is a reference to the place a larger document is split into chunks for calculating embeddings and storage.

link

cdolan 878 days ago

I am also curious about this. ADI is reliable but does have edge case issues on malformed PDF

I fear tesseract OCR is a potential limitation though. I’ve seen it make so many mistakes

link