| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by chiccomagnus 742 days ago

Well-written article, missing key considerations:

- Titles matter, a lot: if you add the title of the section at the start of each chunk you will get 10x better embeddings and so more accurate results.

- The size doesn't matter: It depends on the combination of the layout and semantics of the content.

- Avoid garbage in / out: increased context windows don't mean you can put trash inside them. The more good you are at putting relevant information the more precise answers you get. Especially for enterprise-grade solutions, this is so important.

There are good emerging API solutions that implement semantic + layout-based chunking, which in my opinion is the best chunking strategy for PDF / Office files (the widest use case scenario for enterprises).