Hacker News new | ask | show | jobs
by newpeak 747 days ago
Tagging each paragraph is not a good approach. Instead, using LLM to generate a summary based on the clustering of paragraphs could be a good alternative. That's what RAPTOR(https://arxiv.org/html/2401.18059v1) has suggested.

Regarding to reranker, it's a pure dynamic solution, which is different with re-chunking which requires all data to be reindexed.

Comparing two paragraphs, in most cases, it's based on embedding, which means each paragraph will correspond to a single embedding. So comparing them does not take words into account. However, if you adopts a hybrid search which will use full text search as another kind of recall approach, it will take the words into ranking consideration, in that case, the scores are computed based on the TF/IDF metrics within the paragraph, which is an accumulated score of all hitted tokens.