|
|
|
|
|
by ilaksh
1232 days ago
|
|
How do you decide where to break up the chunks for embedding? On mine I am currently just doing something like X words per chunk. It's seems like ideally I could parse out all source code and avoiding breaking up functions but not sure how to do that for arbitrary languages. |
|
We try to always go for logical breakpoints (e.g. never in the middle of a sentence or explanation). Some docs are cut into smaller chunks because the way they're written works quite well for segmentation, and smaller chunks have the advantage of allowing more to be looked-up, so your semantic search is allowed to mess up as long as it finds 1-2 relevant context elements. For some, we felt like cutting into chunk was losing too much information, so we've added them as quite huge chunks. It feels suboptimal in some ways, especially in terms of performance and modularity, but we've also found that the model is very good at parsing a ±2k token length sample and getting the right info from it in most cases.
Ultimately there's no right answer and it's a case-by-case tradeoff.