|
|
|
|
|
by screye
1104 days ago
|
|
If you're concatenating after chunking , then the overlapping windows add quite a lot of repetition. Also, if it cuts off mid-json / mid-structured output then overlapping windows once again cause issues. Define a custom recursive text splitter in langchain, and do chunking heuristically. It works a lot better. That being said, it is useful to maintain some global and local context. But, I wouldn't use overlapping windows. |
|
When working with more extensive documents, the process gets a bit more intricate. In this case, your embedding database might need to hold more information per entry. Ideally, for each document, the database should store identifiers like the document ID, the starting token number, and the ending token number. This way, even if a document appears more than once among the top results from a query, it's possible to piece together the full relevant excerpt accurately.