|
|
|
|
|
by CharlieDigital
702 days ago
|
|
The easiest solution to this is to stuff the heading into the chunk. The heading is hierarchical navigation within the sections of the document. I found Azure Document Intelligence specifically with the Layout Model to be fantastic for this because it can identify headers. All the better if you write a parser for the output JSON to track depth and stuff multiple headers from the path into the chunk. |
|
If we think about what this is about, it is basically entity augmentation & lexical linking / citations.
Ex: A patient document may be all about patient id 123. That won't be spelled out in every paragraph, but by carrying along the patient ID (semantic entity) and the document (citation), the combined model gets access to them. A naive one-shot retrieval over a naive chunked vector index would want it at the text/embedding, while a smarter one also in the entry metadata. And as others write, this helps move reasoning from the symbolic domain to the semantic domain, so less of a hack.
We are working on some fun 'pure-vector' graph RAG work here to tackle production problems around scale, quality, & always-on scenarios like alerting - happy to chat!