| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by screye 1104 days ago

If you're concatenating after chunking , then the overlapping windows add quite a lot of repetition. Also, if it cuts off mid-json / mid-structured output then overlapping windows once again cause issues.

Define a custom recursive text splitter in langchain, and do chunking heuristically. It works a lot better.

That being said, it is useful to maintain some global and local context. But, I wouldn't use overlapping windows.

2 comments

rahimnathwani 1104 days ago

In place of simply concatenating after chunking, a more effective approach might be to retrieve and return the corresponding segments from the original documents that are relevant to the context. For instance, if we're dealing with short pieces of text such as Hacker News comments, it's fairly straightforward. Any partial match can prompt the return of the entire comment as it is.

When working with more extensive documents, the process gets a bit more intricate. In this case, your embedding database might need to hold more information per entry. Ideally, for each document, the database should store identifiers like the document ID, the starting token number, and the ending token number. This way, even if a document appears more than once among the top results from a query, it's possible to piece together the full relevant excerpt accurately.

gwern 1104 days ago

I don't think the repetition is a problem. He's using a local model for human-assisted writing with pre-generated embeddings - he can use essentially an arbitrary number of embedding calls, as long as it's more useful for the human. So it's just a question of whether that improves the quality or not. (Not that the cost would be more than a rounding error to embed your typical personal wiki with something like the OA API, especially since they just dropped the prices of embeddings again.)