| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by srcreigh 165 days ago
	4/5 of today's top CNN articles have words with periods in them: "Mr.", "Dr.", "No.", "John D. Smith", "Rep." The last one also has periods within quotations, so period chunking would cut off the quote.

4 comments

SteveJS 165 days ago

This gets those cases right.

https://github.com/KnowSeams/KnowSeams

(On a beefy machine) It gets 1 TB/s throughput including all IO and position mapping back to original text location. I used it to split project gutenberg novels. It does 20k+ novels in about 7 seconds.

Note it keeps all dialog together- which may not be what others want, but was what i wanted.

link

snyy 165 days ago

A big chunk size with overlap solves this. Chunks don't have to be be "perfectly" split in order to work well.

link

srcreigh 165 days ago

True, but you don’t need 150GB/s delimiter scanning in that case either.

link

snyy 165 days ago

As the other comment said, its a practice in good enough chunks quality. We focus on big chunks (largest we can make without hurting embedding quality) as fast as possible. In our experience, retrieval accuracy is mostly driven by embedding quality, so perfect splits don't move the needle much.

But as the number of files to ingest grows, chunking speed does become a bottleneck. We want faster everything (chunking, embedding, retrieval) but chunking was the first piece we tackled. Memchunk is the fastest we could build.

link

Havoc 165 days ago

I suspect chunking is an exercise in „good enough“

link

ubertaco 165 days ago

Does this even work if you're incredulous enough???

link