| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by semihsalihoglu 892 days ago

This is a post that summarizes some reading that I had done in the space of LLMs + Knowledge Graphs with the goal of identifying technically deep and interesting directions. The post cover retrieval augmented generation (RAG) systems that use unstructured data (RAG-U) and the role folks envision knowledge graphs to play in it. Briefly the design spectrum of RAG-U systems have two dimensions: 1) What additional data to put into LLM prompts: such as, documents, or triples extracted from documents. 2) How to store and fetch that data: such as a vector index, gdbms, or both.

The standard RAG-U uses vector embeddings of chunks, which are fetched from a vector index. An envisioned role of knowledge graphs is to improve standard RAG-U by explicitly linking the chunks through the entities they mention. This is a promising idea but one that need to be subjected to rigorous evaluation as done in prominent IR publications, e.g., SIGIR.

The post then discusses the scenario when an enterprise does not have a knowledge graph and discuss the ideal of automatically extracting knowledge graphs from unstructured pdfs and text documents. It covers the recent work that uses LLMs for this task (they're not yet competitive with specialized models) and highlights many interesting open questions.

Hope this is interesting to people who are interested in the area but intimidated because of the flood of activity (but don't be; I think the area is easier to digest than it may look.)

3 comments

kordlessagain 890 days ago

Knowledge graphs improve vector search by providing a "back of the book" index for the content. This can be done using knowledge extraction from an LLM during indexing, such as pulling out keyterms of a given chunk before embedding, or asking a question of the content and then answering it using the keyterms in addition to the embeddings. One challenge I found with this is determining keyterms to use with prompts that have light context, but using a time window helps with this, as does hitting the vector store for related content, then finding the keyterms for THAT content to use with the current query.

link

sroussey 890 days ago

What open source model is good at pulling keyterms?

link

laminarflow027 890 days ago

OpenNRE (https://github.com/thunlp/OpenNRE) is another good approach to neural relation extraction, though it's slightly dated. What would be particularly interesting is to combine models like OpenNRE or SpanMarker with entity-linking models to construct KG triples. And a solid, scalable graph database underneath would make for a great knowledge base that can be constructed from unstructured text.

link

sroussey 889 days ago

Nice, I’ll look that up.

I was thinking in terms of RAG and turning text into keywords. Any thoughts there?

link

laminarflow027 889 days ago

By this I presume you mean build a search index that can retrieve results based on keywords? I know certain databases use Lucene to build a keyword-based index on top of unstructured blobs of data. Another alternative is to use Tantivy (https://github.com/quickwit-oss/tantivy), a Rust version of Lucene, if building search indices via Java isn't your cup of tea :)

Both libraries offer multilingual support for keywords, I believe, so that's a benefit to vector search where multilingual embedding models are rather expensive.

link

semihsalihoglu 890 days ago

For entity extraction you can look at SpanMarker: https://tomaarsen.github.io/SpanMarkerNER/. I'm sure other tools exists and others can hopefully point at more.

link

daxfohl 890 days ago

Having just started from zero, I agree on the easy to digest point. You can get a pretty good understanding of how most things work in a couple days, and the field is moving so fast that a lot of papers are just exploring different iterative improvements on basic concepts.

link

mark_l_watson 890 days ago

I really liked the idea of creating linked data to connect chunks. That is an idea that deserves some play time (I just added it to my TODO list). Thanks for the good ideas!

link