|
|
|
|
|
by CharlieDigital
702 days ago
|
|
> positionally related chunks but also semantically related ones
That's why the entry point would still be an embedding search; it's just that instead of using the first 20 embedding hits, you take the first 5 and if the reference is "semantically adjacent" to the entry concept, we would expect that some of the first few chunks would capture it in most cases.I think where GRAG yields more relevancy is when the referenced content is not semantically similar nor even semantically adjacent to the entry concept but is semantically similar to some sub fragment of a matched chunk. Depending on the corpus, this can either be common (no familiarity with financial documents) or rare. I've primarily worked with clinical trial protocols and at least in that space, the concepts are what I would consider "snowflake-shaped" in that it branches out pretty cleanly and rarely cross-references (because it is more common that it repeats the relevant reference). All that said, I think that as a matter of practicality, most teams will probably get much bigger yield with much less effort doing local expansion based on matching for semantic similarity first since it addresses two core problems with embeddings (text chunk size vs embedding accuracy, relevancy or embeddings matched below a given threshold). Experiment with GRAG depending on the type of questions you're trying to answer and the nature of the underlying content. Don't get me wrong; I'm not saying GRAG has no benefit, but that most teams can explore other ways of using RAG before trying GRAG. |
|
GRAG in the direction of the MSR paper adds some important areas:
- summary indexes that can be lexical (document hierarchy) or not (topic, patient ID, etc), esp via careful entity extraction & linking
- domain-optimized summarization templates, both automated & manual
- + as mentioned, wider context around these at retrieval
- introducing a more generalized framework for handling different kinds of concept relations, summary indexing, and retrieval around these
Ex: The same patient over time & docz, and seperately, similar kinds of patients across documents
Note that I'm not actually a big fan of how the MSR paper indirects the work through KG extraction, as that exits the semantic domain, and we don't do it that way
Fundamentally, that both moves away from paltry retrieval result sets that are small/gaps/etc, and enables cleaner input to the runtime query
I agree it is a quick win if quality can be low and you have low budget/time. Like combine a few out of the box index types and do rank retrieval. But a lot of the power gets lost. We are working on infra (+ OSSing it) because that is an unfortunate and unnecessary state of affairs. Right now llamaindex/langchain and raw vector DBs feel like adhoc and unprincipled ways to build these pipelines in a software engineering and AI perspective, so from an investment side, moving away from hacks and to more semantic, composable, & scalable pipelines is important IMO.