Hacker News new | ask | show | jobs
Show HN: Demystifying Advanced RAG Pipelines (github.com)
131 points by pchunduri6 970 days ago
I've built an advanced RAG (Retrieval-Augmented Generation) pipeline from scratch to demystify the complex mechanics of modern LLM-powered Question Answering systems. This repository features:

-- An implementation of a sub-question query engine from scratch to answer complex user questions.

-- Illustrative explanations that unveil the inner workings of the system.

-- An analysis of the challenges I faced while working with the system, like prompt engineering and cost estimation.

-- Qualitative comparison with similar frameworks like LlamaIndex, offering a broader perspective.

Key Takeaway: While Modern QA pipelines with advanced RAG abstractions may seem complex, they are fundamentally powered by a series of LLM calls with meticulous prompt design. Hoping that this repository provides intuitive insights for building more robust and efficient RAG systems. All feedback is warmly welcomed!

6 comments

This is a great README! It clearly breaks down some approaches to RAG. I also approciate how you strive to de-mystify what’s going on under the hood, which is in many ways VERY simple.

This seems very similar to LangSmith’s trace monitoring, which I have been leaning on heavily for observability. You also mention LlamaIndex— how do you see your project fitting into the ecosystem?

I don’t think I would able to use this yet because it is serial. Is it possible to non-serially issue independent sub-question queries?

In my experimental agent system, waggledance.ai[1], I have been working on a pre-agent step of picking and synthesizing the right context and tools[2] for a given subtask of a larger goal, and it seems to be boosting results. It looks like now I have to try sub-question answering in the mix as well.

[1] demo - https://waggledance.ai

[2] relevant code sample - https://github.com/agi-merge/waggle-dance/blob/1b14163c24fd2...

Thanks for the kind words and the great questions!

-- LlamaIndex has some excellent abstractions. In fact, I started off this project with LlamaIndex using their sub-question query engine. However, I found that the abstractions often obfuscate the prompt templates and the pipeline itself from the user. I found that writing my own pipeline was easier than trying to figure out how to engineer the prompts that LlamaIndex was using.

-- It is possible to non-serially issue independent sub-question queries (e.g., using async io). LlamaIndex does something similar. However, I would be extra careful while issuing parallel sub-queries due to the brittle nature of the system.

-- Cool project! I like the fact that the agent decision-making is clearly shown in the UI. A few questions: 1) How do you handle LLM output inconsistencies? 2) Can the user change the prompts for tasks or sub-tasks if the output is not satisfactory? Overall, a great idea and this sub-question query engine might simplify some of the abstractions here.

1) What do you mean by LLM output inconsistencies? Most LLM responses are parsed, and then if that fails, an attempt to auto-fix them is made by re-running the previous output through a rewriting/schema prompt.

2) I want that feature too, and have it planned! I want to have a sort of knowledge / progress dashboard, where users can "chat their data". I also want to add to each sub-task the ability to restart from that point. Essentially, since the project is a running on an entirely serverless architecture, this means serializing everything important, canceling current functions, and then re-hydrating from a certain point and calling the serverless functions again.

1) While building this system, I found that the LLM can sometimes generate unpredictable responses. For example, the LLM sometimes chooses to summarize the document even for a simple retrieval question. When using expensive LLM models, this mistake could result in 10x higher cost. In your case, the LLM could generate sub-tasks that incur significant operating overheads. Just curious if you're currently facing such issues and if you have plans to mitigate them.

2) The restart idea is neat! I often faced this scenario where only few sub-questions have some issues that need to be fixed. Tweaking them without re-running the whole pipeline seems like a useful feature in this case.

As a researcher I've been interested in developing a RAG pipeline populated with research articles on my topic of study. Does it fit easily in the RAG approach to also return excerpts from the actual documents as to help me verify, at a glance the source and veracity of LLM outputs?
Yes, this is an excellent RAG use-case! The vector index that I use in the repository uses EvaDB [1] to retrieve the top-K matches to the user queries from the available data sources. So, you can manually inspect the best matches to your query from the research article and verify the correctness of the LLM responses.

[1] https://github.com/georgia-tech-db/evadb

You can do the summarization part how ever you want. You don't even need have an llm summarize what the program found. The context that includes the answer, so you can just include that in your final response.
It is currently not possible to get rigorous summaries of paper chunks using GPT-4.
I love this write up. Thank you ! I’m looking for more resources like this - clear examples of composing LLMs into useful systems. Some of the cookbook examples in langchain, chainlit , etc have been useful too.
Thanks for the kind words! +1 for chainlit. I love their documentation. Do you have any specific use-cases in mind that would benefit from such pipelines?
I’m really interested in content explaining how to navigate graphs of embedded items for Q/A. Any resources on how to do this or arguments for why it’s a bad approach?

For example, if my top K docs aren’t answering the question but each are linked to neighbors, I’d want to know some folk wisdom or tricks for structuring the neighbor graph to cheaply expand the set of useful results.

You could in theory create a tool/function like “Context Retrieval”, give it to an Agent, and instruct the Agent to paginate through it as needed. This would add some errors due to LLM usage and latency though.

And then of course you would still need to design the graph structure. Maybe neo4j or similar graph dbs would be useful? I have seen a langchain integration for instance: https://python.langchain.com/docs/integrations/providers/neo...

Not using neighbour, but Autogen has a multi-agent pipeline where if a question is not answered by top K docs, the agent can request "UPDATE CONTEXT" and pull the next set of results.

See example 5 here: https://github.com/microsoft/autogen/blob/main/notebook/agen...

Wouldn't more semantically related neighbors be retrieved by just increasing K?
Potentially, yes! The scenario I am imagining is that context A as context for a query yields results 1, 2, and 3. Sometimes, finding neighbors of 1 (ie not necessarily in the top K w.r.t A) instead of going to results 4, 5, 6 might be better.
Pretty cool tutorial. As a side note, it is pretty hard to evaluate these pipelines for quality once you build them since there's not many standard practices yet given how new this all is. If it's helpful to anyone else, we built a free open source tool within my company that is basically a collection of premade metrics for determining the quality of these pipelines. https://github.com/TonicAI/tvalmetrics
This is really useful! Using LLM-assisted evaluation seems like the way to go for evaluating RAG applications. One issue I've faced while evaluating responses using GPT-4 is that the evaluation cost can go out of hand rather quickly. Do you have any measures in place or ideas on how to handle this?
Unfortunately, right now the LLM cost is just a fundamental issue. I think it is hard to get around because comparing answer quality usually involves understanding the question and answer itself which is a task that's really well suited to LLMs.

One thing we have considered is some forms of evaluation could be replaced simply with using the embeddings of the question, context, and answer instead of using the LLM model for analysis. The idea is you could compare all the embeddings to get a rough idea of the performance based on similarity. That should in theory reduce costs. The only other alternative is just to use less advanced models which are cheaper.

Great explanation!