| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by joshring 848 days ago

Is there a roadmap for planned features in the future? I wouldn't call this a "powerful tool for addressing key challenges in deploying RAG systems" right now. It seems to do the most simple version of RAG that the most basic RAG tutorial teaches someone how to do with a pretty UI over it.

The most key challenges I've faced around RAG are things like:

- Only works on text based modalities (how can I use this with all types of source documents, including images)

- Chunking "well" for the type of document (by paragraph, csvs including header on every chunk, tables in pdfs, diagrams, etc). The rudimentary chunk by character with overlap is demonstrably not very good at retrieval

- the R in rag is really just "how can you do the best possible search for the given query". The approach here is so simple that it is definitely not the best possible search results. It's missing so many known techniques right now like:

    - Generate example queries that the chunk can answer and embed those to search against.

    - Parent document retrieval

    - so many newer better Rag techniques have been talked about and used that are better than chunk based

    - How do you differentiate "needs all source" vs "find in source" questions? Think: Summarize the entire pdf, vs a specific question like how long does it take for light to travel to the moon and back?

- Also other search approaches like fuzzy search/lexical based approaches. And ranking them based on criterial like (user query is one word, use fuzzy search instead of semantic search). Things like that

So far this platform seems to just lock you into a really simple embedding pipeline that only supports the most simple chunk based retrieval. I wouldn't use this unless there was some promise of it actually solving some challenges in RAG.

3 comments

ocolegro 848 days ago

Thanks for taking the time to provide your candid feedback, I think you have made a lot of good points.

You are correct that the options in R2R are fairly simple today - Our approach here is to get input from the developer community to make sure we are on the right track before building out more novel features.

Regarding your challenges:

- Only works on text based modalities (how can I use this with all types of source documents, including images)

  For the immediate future R2R will likely remain focused on text, but you are right that the problem gets even more challenging when you introduce the idea of images. I'd like to start working on multi-modal soon.

  This is very true - a short/medium term goal of mine is to integrate some more intelligent chunking approaches, ranging from Vikp's Surya to Reducto's proprietary model. I'm also interested in exploring what can be done from the pure software side.

    - Generate example queries that the chunk can answer and embed those to search against.

    - Parent document retrieval

    - so many newer better Rag techniques have been talked about and used that are better than chunk based

    - How do you differentiate "needs all source" vs "find in source" questions? Think: Summarize the entire pdf, vs a specific question like how long does it take for light to travel to the moon and back?

You mentioned "Generate example queries", there is already an example that shows how to generate and search over synthetic queries w/ minor tweaks to the basic pipeline [https://github.com/SciPhi-AI/R2R/blob/main/examples/academy/...].

I think the other other approaches you outline are all worth investigating as well. There is definitely a tension we face between building and testing new experimental approaches vs. figuring out what features people need in production and implementing those.

Just so you know where we are heading - we want to make sure all the features are there for easy experimentation, but we also want to provide value into production and beyond. As an example, we are currently working on robust task orchestration to accompany our pipeline abstractions to help with ingesting large quantities of data, as this has been a painpoint in our own experience and that of some of our early enterprise users.

link

joshring 848 days ago

Nice, thanks for the reply. Glad to hear you are looking into these challenges and plan to tackle some of them. Will keep my eye on the repo for some of these improvements in the future.

And totally agree, the scaling out of ingesting large quantities of data is a hard challenge as well and it does make sense to work on that problem space too. Sounds like that is a higher priority at the moment which is totally fine.

link

ocolegro 848 days ago

No worries, thanks again the thoughtful feedback.

We are also very interested in the more novel RAG techniques, so I'm not sure that one is necessarily a higher priority than the other.

We've just gotten more immediate feedback from our early users around the difficulties of ingesting data in production and there is less ambiguity around what to build.

Out of your previous list, is there one example that you think would be most useful for the next addition to the framework?

link

joshring 848 days ago

Well, as someone building something similar I have been looking around at how people are tackling the problem of varied index approaches for different files, and again how that can scale.

I haven't read the code on your github but the readme mentions using qdrant/pgvector. I'm curious how you will tackle having that scale to billions of files with tens/hundreds/etc? different indexing approaches for each file. It doesn't feel tennable to keep it in a single postgres instance as it will just grow and grow forever.

Think even a very simple example of more indexes per file: having chunk sizes of 20/500/1000 along with various overlaps of 50/100/500. You suddenly have a large combination of indexes you need to maintain and each is basically a full copy of the source file. (You can imagine indexes for BM25, fuzzy matching, lucene, etc...)

You could be brute force ish and always run every single index mode for every file until a better process exists to only do the best ones for a specific file. But even if you narrowed it down a file could want 5 different index types searched and ranked for Retrieval step.

I want to know how people plan to shard/make it possible to have so many search indexes on all their data and still be able to query against all of it. Postgres will eventually run out of space even on the beefiest cloud instance fairly quickly.

The second biggest thing is then to tackle how to use all of those indexes well in the Retrieval step. Which indexes should be searched against/weighted and how given the user query/convo history?

link

chiccomagnus 847 days ago

You are both right about chunking, and i think is one of the main challenges. About more intelligent chunking approaches, i think you have to give a try to to preprocess.co It's able to preprocess and chunk PDFs, Office Files, and HTML content. It follows the original document layout considering the content semantics so you get optimal chunks

link

viraptor 848 days ago

Do you know of any open source project which does support the extra functionality around the different approaches to embedding / queries?

link

nl 848 days ago

LlamaIndex (and I think LangChain) does these things.

link

SeanAppleby 848 days ago

This is my problem with every end to end system I've seen around this. I find that, even building these systems from scratch, all of the hard parts are just normal data infrastructure problems. The "AI" part takes a small fraction of the effort to deliver even when just building the RAG part directly on top of huggingface/transformers.

I also have dealt with what you're describing, but then it goes much farther when going to prod IME. The ingestion part is even more messy in ways these kinds of platforms don't seem to help with. When managing multiple tools in prod with overlapping and non-constant data sources (say, you have two tools that need to both know the price of a product, which can change at any time), I need both of those to be built on the same source of truth and for that source of truth to be fed by our data infra in real time, where relevant documents need to be replaced in real time in more or less an atomic way.

Then, I have some tools that have varying levels of permissioning on those overlapping data sources, say, you have two tools that exist in a classroom, one that helps the student based on their work, and another that is used by the TA or teacher to help understand students' answers in a large course. They have overlapping data needs on otherwise private data, and this kind of permissioning layer which is pretty trivial in a normal webapp has, IME, had to have been implemented basically from scratch on top of the vector db and retrieval system.

Then experimentation, eval, testing, and releases are the hardest and most underserved. It was only relatively recently that it seemed like anyone even seemed to be talking about eval as a problem to aspire to solve. There's a pretty interesting and novel interplay of the problems of production ML eval, but with potentially sparse data, and conventional unit testing. This is the area we had to put the most of our own thought into for me to feel reasonably confident in putting anything into prod.

FWIW we just built our own internal platform on top of langchain a while back, seemed like a good balance of the right level of abstraction for our use cases, solid productivity gains from shared effort.

I think this is a really interesting problem space, but yeah, I'm skeptical of all of these platforms as they seem to always be promising a lot more than they're delivering. It looks superficially like there has been all of this progress on tooling, but I built a production service based on vector search in 2018 and it really isn't that much easier today. It works better because the models are so much better, but the tools and frameworks don't help that much with the hard parts, to my surprise honestly.

Perhaps I'm just not the user and am being excessively critical, but I keep having to deal with execs and product people throwing these frameworks at us internally without understanding the alignment between what is hard about building these kinds of services in prod and what these kinds of tools make easier vs harder.

link

ocolegro 847 days ago

This is AMAZING feedback and it is on brand with what I've heard from a number of builders. Thanks for sharing your experiences here.

The infra challenges are real - it has been what I have been struggling the most with in providing high quality support for early users. Most want to be able to reliably firehose 10-100s of GBs of data through a brittle multistep pipeline. This was something I struggled with when building AgentSearch [https://huggingface.co/datasets/SciPhi/AgentSearch-V1] with LOCAL data - so introducing the networking component only makes things that much harder.

I think we have a lot of work to do to robustly solve this problem, but I'm confident that there is an opportunity to build a framework that results in net positives for the developer.

FWIW, Your feedback would be invaluable as the project continues to grow.

link