Hacker News new | ask | show | jobs
by StrauXX 878 days ago
Last I used it, Azure Document Intelligence wasn't all that smart about choosing split points. This seems to implement better heuristics.
2 comments

I wrote about split points and the need for including section hierarchy in this post: https://ambikasukla.substack.com/p/efficient-rag-with-docume...

All this is automated in the llmsherpa parser https://github.com/nlmatics/llmsherpa which you can use as an API over this library.

What is a split point? I use Textract a lot and from my testing, always beats out any of the open source tooling to extract information. That could also be highly dependent on the document format.
I think it is a reference to the place a larger document is split into chunks for calculating embeddings and storage.