| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by CharlieDigital 702 days ago
	The easiest solution to this is to stuff the heading into the chunk. The heading is hierarchical navigation within the sections of the document. I found Azure Document Intelligence specifically with the Layout Model to be fantastic for this because it can identify headers. All the better if you write a parser for the output JSON to track depth and stuff multiple headers from the path into the chunk.

3 comments

lmeyerov 702 days ago

So subtle! The article is on doing that, which is something we are doing a lot on right now... though it seems to snatch defeat from the jaws of victory:

If we think about what this is about, it is basically entity augmentation & lexical linking / citations.

Ex: A patient document may be all about patient id 123. That won't be spelled out in every paragraph, but by carrying along the patient ID (semantic entity) and the document (citation), the combined model gets access to them. A naive one-shot retrieval over a naive chunked vector index would want it at the text/embedding, while a smarter one also in the entry metadata. And as others write, this helps move reasoning from the symbolic domain to the semantic domain, so less of a hack.

We are working on some fun 'pure-vector' graph RAG work here to tackle production problems around scale, quality, & always-on scenarios like alerting - happy to chat!

link

CharlieDigital 702 days ago

Also working with GRAG (via Neo4j) and I'm somewhat skeptical that for most cases where a natural hierarchical structure already exists that graph will significantly exceed RAG with the hierarchical structure.

A better solution I had thought about its "local RAG". I came across this while processing embeddings from chunks parsed from Azure Document Intelligence JSON. The realization is that relevant topics are often localized within a document. Even across a corpus of documents, relevant passages are localized.

Because the chunks are processed sequentially, one needs only to keep track o the sequence number of the chunk. Assume that the embedding matches with a chunk n, then it would follow that the most important context are the chunks localized at n - m and n + p. So find the top x chunks via hybrid embedding + full text match and expand outwards from each of the chunks to grab the chunks around it.

While a chunk may represent just a few sentences of a larger block of text, this strategy will grab possibly the whole section or page of text localized around the chunk with the highest match.

link

michalwarda 702 days ago

This works until relevant information is colocated. Sometimes though, for example in financial documents, important parts reference each other through keywords etc. That's why you can always try and retrieve not only positionally related chunks but also semantically related ones.

Go for chunk n, n - m, n + p and n' where n' are closest chunks to n semantically.

Moreover you can give this traversal possibility to your LLM to use itself as a tool or w/e whenever it is missing crucial information to answer the question. Thanks to that you don't always retrieve thousands of tokens even when not needed.

link

CharlieDigital 702 days ago

    > positionally related chunks but also semantically related ones

That's why the entry point would still be an embedding search; it's just that instead of using the first 20 embedding hits, you take the first 5 and if the reference is "semantically adjacent" to the entry concept, we would expect that some of the first few chunks would capture it in most cases.

I think where GRAG yields more relevancy is when the referenced content is not semantically similar nor even semantically adjacent to the entry concept but is semantically similar to some sub fragment of a matched chunk. Depending on the corpus, this can either be common (no familiarity with financial documents) or rare. I've primarily worked with clinical trial protocols and at least in that space, the concepts are what I would consider "snowflake-shaped" in that it branches out pretty cleanly and rarely cross-references (because it is more common that it repeats the relevant reference).

All that said, I think that as a matter of practicality, most teams will probably get much bigger yield with much less effort doing local expansion based on matching for semantic similarity first since it addresses two core problems with embeddings (text chunk size vs embedding accuracy, relevancy or embeddings matched below a given threshold). Experiment with GRAG depending on the type of questions you're trying to answer and the nature of the underlying content. Don't get me wrong; I'm not saying GRAG has no benefit, but that most teams can explore other ways of using RAG before trying GRAG.

link

lmeyerov 702 days ago

Neo4j graph rag is typically not graph rag in the AI sense / MSR Graph RAG paper sense, but KG or lexical extraction & embedding, and some retrieval time hope of the neighborhood being ok

GRAG in the direction of the MSR paper adds some important areas:

- summary indexes that can be lexical (document hierarchy) or not (topic, patient ID, etc), esp via careful entity extraction & linking

- domain-optimized summarization templates, both automated & manual

- + as mentioned, wider context around these at retrieval

- introducing a more generalized framework for handling different kinds of concept relations, summary indexing, and retrieval around these

Ex: The same patient over time & docz, and seperately, similar kinds of patients across documents

Note that I'm not actually a big fan of how the MSR paper indirects the work through KG extraction, as that exits the semantic domain, and we don't do it that way

Fundamentally, that both moves away from paltry retrieval result sets that are small/gaps/etc, and enables cleaner input to the runtime query

I agree it is a quick win if quality can be low and you have low budget/time. Like combine a few out of the box index types and do rank retrieval. But a lot of the power gets lost. We are working on infra (+ OSSing it) because that is an unfortunate and unnecessary state of affairs. Right now llamaindex/langchain and raw vector DBs feel like adhoc and unprincipled ways to build these pipelines in a software engineering and AI perspective, so from an investment side, moving away from hacks and to more semantic, composable, & scalable pipelines is important IMO.

link

CharlieDigital 702 days ago

    > Neo4j graph rag is typically not graph rag

I would mildly disagree with this; Neo4j just serves as an underlying storage mechanism much like Postgres+pgvector could be the underlying storage mechanism for embedding-only RAG. How one extracts entities and connects them in the graph happens a layer above the storage layer of Neo4j (though Neo4j can also do this internally). Neo4j is not magic; the application layer and data modelling still has to define which entities and how they are connected.

But why Neo4j? Neo4j has some nice amenities for building GRAG on top of. In particular, it has packages to support community partitioning including Leiden[0] (also used by Microsoft's GraphRAG[1]) and Louvain[2] as well as several other community detection algorithms. The built-in support for node embeddings[3] as well as external AI APIs[4] make the DX -- in so far as building the underlying storage for complex retrieval -- quite good, IMO.

The approach that we are taking is that we are importing a corpus of information into Neo4j and performing ETL on the way in to create additional relationships; effectively connecting individual chunks by some related "facet". Then we plan to run community detection over it to identify communities of interest and use a combination of communities, locality, and embedding match to retrieve chunks.

I just started exploring it over the past week and I would say that if your team is going to end up doing some more complex GRAG, then Neo4j feels like it has the right tooling to be the underlying storage layer and you could even feasibly implement other parts of your workflow in there as well, but entity extraction and such feels like it belongs one layer up in the application layer. Most notably, having direct query access to the underlying graph with a graph query language (Cypher) means that you will have more control and different ways to experiment with retrieval. However; as I mentioned, I would encourage most teams to be more clever with embedding RAG before adding more infrastructure like Neo4j.

[0] https://neo4j.com/docs/graph-data-science/current/algorithms...

[1] https://microsoft.github.io/graphrag/

[2] https://neo4j.com/docs/graph-data-science/current/algorithms...

[3] https://neo4j.com/docs/graph-data-science/current/machine-le...

[4] https://neo4j.com/labs/apoc/5/ml/openai/

link

ec109685 702 days ago

Would it be better to go all the way and completely rewrite the source material in a way more suitable for retrieval? To some extent these headers are a step in that direction, but you’re still at the mercy of the chunk of text being suitable to answer the question.

Instead, completely transforming the text into a dense set of denormalized “notes” that cover every concept present in the text seems like it would be easier to mine for answers to user questions.

Essentially, it would be like taking comprehensive notes from a book and handing them to a friend who didn’t take the class for a test. What would they need to be effective?

Longer term, the sequence would likely be “get question”, hand it to research assistant who has full access to source material and can run a variety of AI / retrieval strategies to customize the notes, and then hand those notes back for answers. By spending more time on the note gathering step, it will be more likely the LLM will be able to answer the question.

link

CharlieDigital 702 days ago

For a large corpus, this would be quite expensive in terms of time and storage space. My experience is that embeddings work pretty well around 144-160 tokens (pure trial and error) with clinical trial protocols. I am certain that this value will be different by domain and document types.

If you generate and then "stuff" more text into this, my hunch is that accuracy drops off as the token count increases and it becomes "muddy". GRAG or even normal RAG can solve this to an extent because -- as you propose -- you can generate a congruent "note" and then embed that and link them together.

I'd propose something more flexible: expand on the input query instead and basically multiplex it to the related topics and ideas instead and perform cheap embedding search using more than 1 input vector.

link

williamcotton 702 days ago

Contextual chunk headers

The idea here is to add in higher-level context to the chunk by prepending a chunk header. This chunk header could be as simple as just the document title, or it could use a combination of document title, a concise document summary, and the full hierarchy of section and sub-section titles.

That is from the article. Is this different from your suggested approach?

link

CharlieDigital 702 days ago

No, but this is also not really a novel solution.

link