| HN Mirror

Yep benchmarks are available at https://github.com/ZeroEntropy-AI/llama-chunk?tab=readme-ov-... , we used this dataset https://github.com/ZeroEntropy-AI/legalbenchrag which is a retrieval-focused version of LegalBench.

It scored better than LlamaIndex's recursive character text splitter and that was including some custom regex work to improve it. If you put enough effort into the regex you could probably get there, but the whole point of the agentic chunking is for it to be automatic and contextual.