| HN Mirror

I've experimented with a few approaches and to be honest, kind of gone with what "felt best" as we're quite artisanal with our testing approach at the moment.

We try to always go for logical breakpoints (e.g. never in the middle of a sentence or explanation). Some docs are cut into smaller chunks because the way they're written works quite well for segmentation, and smaller chunks have the advantage of allowing more to be looked-up, so your semantic search is allowed to mess up as long as it finds 1-2 relevant context elements. For some, we felt like cutting into chunk was losing too much information, so we've added them as quite huge chunks. It feels suboptimal in some ways, especially in terms of performance and modularity, but we've also found that the model is very good at parsing a ±2k token length sample and getting the right info from it in most cases.

Ultimately there's no right answer and it's a case-by-case tradeoff.