| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by StrauXX 878 days ago
	Last I used it, Azure Document Intelligence wasn't all that smart about choosing split points. This seems to implement better heuristics.

2 comments

asukla 878 days ago

I wrote about split points and the need for including section hierarchy in this post: https://ambikasukla.substack.com/p/efficient-rag-with-docume...

All this is automated in the llmsherpa parser https://github.com/nlmatics/llmsherpa which you can use as an API over this library.

link

infecto 878 days ago

What is a split point? I use Textract a lot and from my testing, always beats out any of the open source tooling to extract information. That could also be highly dependent on the document format.

link

batch12 878 days ago

I think it is a reference to the place a larger document is split into chunks for calculating embeddings and storage.

link