Hacker News new | ask | show | jobs
by davidajackson 924 days ago
Chunking is one of those things that needs to be custom to the document being processed. As a general rule, try recursive chunking for most questions. Consider using multiple strategies in tandem. It has the nice advantage of incorporating both broad and specific context. However, even the document itself is not enough to design a chunking strategy. Consider an HTML document and the questions:

1. How does that webpage code work?

2. Summarize this website.

As you can see, you might benefit from pre-processing info different based on the intended result. One cares strongly about the tags, styling etc. while one only cares about the text and you could maybe just scrub the tags.

Also, consider chunks overlap and max chunk size and tune them based on different trials.

Check your chunk scores (cos similarity) against sample queries and make sure chunks texts are meaningful. "Is this how I would store info in my head?" might be a good way to start, if your chunks are garbage you will get garbage.

Consider visualizing your chunks in clusters to validate topic relevance.

Last thing a RAG is a multi-step arch., only one step being bad will turn the whole thing to garbage, put lots of debugging, eval steps in yours. Make sure its not the prompt step thats ruining it. Identify the weak points and triage accordingly.