| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by PeterStuer 922 days ago

To get consistent semantic chunks for RAG, you can't just slice it into arbitrary 2k character chuncks after doing a PDF text extraction.

Most documents have implicit structural semantics not explicitly worded in the text. You need to surface and embed those into the chunks, and also furter enrich chunk candidades by flatten in references and other relations.

There is no general solution to this. While you can apply chunking and enrichment patterns, each document type is a bespoke effort to get it right.