Hacker News new | ask | show | jobs
by ocolegro 848 days ago
Thanks for taking the time to provide your candid feedback, I think you have made a lot of good points.

You are correct that the options in R2R are fairly simple today - Our approach here is to get input from the developer community to make sure we are on the right track before building out more novel features.

Regarding your challenges:

- Only works on text based modalities (how can I use this with all types of source documents, including images)

  For the immediate future R2R will likely remain focused on text, but you are right that the problem gets even more challenging when you introduce the idea of images. I'd like to start working on multi-modal soon.
- Chunking "well" for the type of document (by paragraph, csvs including header on every chunk, tables in pdfs, diagrams, etc). The rudimentary chunk by character with overlap is demonstrably not very good at retrieval

  This is very true - a short/medium term goal of mine is to integrate some more intelligent chunking approaches, ranging from Vikp's Surya to Reducto's proprietary model. I'm also interested in exploring what can be done from the pure software side.
- the R in rag is really just "how can you do the best possible search for the given query". The approach here is so simple that it is definitely not the best possible search results. It's missing so many known techniques right now like:

    - Generate example queries that the chunk can answer and embed those to search against.

    - Parent document retrieval

    - so many newer better Rag techniques have been talked about and used that are better than chunk based

    - How do you differentiate "needs all source" vs "find in source" questions? Think: Summarize the entire pdf, vs a specific question like how long does it take for light to travel to the moon and back?
You mentioned "Generate example queries", there is already an example that shows how to generate and search over synthetic queries w/ minor tweaks to the basic pipeline [https://github.com/SciPhi-AI/R2R/blob/main/examples/academy/...].

I think the other other approaches you outline are all worth investigating as well. There is definitely a tension we face between building and testing new experimental approaches vs. figuring out what features people need in production and implementing those.

Just so you know where we are heading - we want to make sure all the features are there for easy experimentation, but we also want to provide value into production and beyond. As an example, we are currently working on robust task orchestration to accompany our pipeline abstractions to help with ingesting large quantities of data, as this has been a painpoint in our own experience and that of some of our early enterprise users.

2 comments

Nice, thanks for the reply. Glad to hear you are looking into these challenges and plan to tackle some of them. Will keep my eye on the repo for some of these improvements in the future.

And totally agree, the scaling out of ingesting large quantities of data is a hard challenge as well and it does make sense to work on that problem space too. Sounds like that is a higher priority at the moment which is totally fine.

No worries, thanks again the thoughtful feedback.

We are also very interested in the more novel RAG techniques, so I'm not sure that one is necessarily a higher priority than the other.

We've just gotten more immediate feedback from our early users around the difficulties of ingesting data in production and there is less ambiguity around what to build.

Out of your previous list, is there one example that you think would be most useful for the next addition to the framework?

Well, as someone building something similar I have been looking around at how people are tackling the problem of varied index approaches for different files, and again how that can scale.

I haven't read the code on your github but the readme mentions using qdrant/pgvector. I'm curious how you will tackle having that scale to billions of files with tens/hundreds/etc? different indexing approaches for each file. It doesn't feel tennable to keep it in a single postgres instance as it will just grow and grow forever.

Think even a very simple example of more indexes per file: having chunk sizes of 20/500/1000 along with various overlaps of 50/100/500. You suddenly have a large combination of indexes you need to maintain and each is basically a full copy of the source file. (You can imagine indexes for BM25, fuzzy matching, lucene, etc...)

You could be brute force ish and always run every single index mode for every file until a better process exists to only do the best ones for a specific file. But even if you narrowed it down a file could want 5 different index types searched and ranked for Retrieval step.

I want to know how people plan to shard/make it possible to have so many search indexes on all their data and still be able to query against all of it. Postgres will eventually run out of space even on the beefiest cloud instance fairly quickly.

The second biggest thing is then to tackle how to use all of those indexes well in the Retrieval step. Which indexes should be searched against/weighted and how given the user query/convo history?

You are both right about chunking, and i think is one of the main challenges. About more intelligent chunking approaches, i think you have to give a try to to preprocess.co It's able to preprocess and chunk PDFs, Office Files, and HTML content. It follows the original document layout considering the content semantics so you get optimal chunks