|
|
|
|
|
by TeMPOraL
972 days ago
|
|
> if you compress 25 pages of text into 1024 floats, you will lose a ton of information Sure, but then if you do it one page at a time, or one paragraph at a time, you lose ton of meaning - after all, individual paragraphs aren't independent of each other. And meaning is kind of the whole point of the exercise. Or put another way, squashing a ton of text loses you some high-frequency information, while chunking cuts off the low-frequency parts. Ideally you'd want to retain both. |
|
I use a multi-pronged approach to this based on a special type of summarization. I chunk on sentences using punctuation until they are just over 512 characters, then I embed them. After embedding, I ask a foundation model to summarize (or ask a question about the chunk) and then generate keyterms for it. Those keyterms are stored along with the vector in the database. During search, I use the user's input to do a vector search for matching chunks, then pull their keyterms in. Using those keyterms, I do set operations to find related chunks. I then run a vector search against these to the top matches from the vector search to assemble new prompt text.
This strategy is based on the idea of a "back of the book index". It is entirely plausible to look for "outliers" in the keyterms and consider throwing those chunks with those keyterms in there to see if it nets us understanding of some "hidden" meaning in the document.
There is also a means to continue doing the "keyterm" extraction trick as the system is used. Keyterms from answer as well as user prompts may be added to the existing index over time, thus helping improve the ability to return low frequency information that may be initially hidden.