| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by rahimnathwani 1107 days ago
	In place of simply concatenating after chunking, a more effective approach might be to retrieve and return the corresponding segments from the original documents that are relevant to the context. For instance, if we're dealing with short pieces of text such as Hacker News comments, it's fairly straightforward. Any partial match can prompt the return of the entire comment as it is. When working with more extensive documents, the process gets a bit more intricate. In this case, your embedding database might need to hold more information per entry. Ideally, for each document, the database should store identifiers like the document ID, the starting token number, and the ending token number. This way, even if a document appears more than once among the top results from a query, it's possible to piece together the full relevant excerpt accurately.