Hacker News new | ask | show | jobs
by davedx 724 days ago
I’ve checked out quite a few RAG projects now and what I haven’t seen really solved is ingestion, it’s usually like “this is an endpoint or some connectors, have fun!”.

How do I do a bulk/batch ingest of say, 10k html documents into this system?

4 comments

All the pipelines are async, so for ingestion we have typically seen that R2R can saturate the vector db or embedding provider. We don't yet have backpressure so it is up to the client to rate limit.

Ingestion is pretty straightforward, you can call R2R directly or use the client-server interface to pass the html files in directly to the ingest_files endpoint (https://r2r-docs.sciphi.ai/api-reference/endpoint/ingest_fil...).

The data parsers are all fairly simple and easy to customize. Right now we use bs4 for handling HTML but have been considering other approaches.

What specific features around ingestion have you found lacking?

Thanks, I’ll give it a try!
I'd like to know this too. A quick: "take these docs as input, ingest and save, now sit there providing an API to get results" service guide.
Take a look here - https://r2r-docs.sciphi.ai/quickstart#ingest-data and here https://r2r-docs.sciphi.ai/cookbooks/client-server#ingest-do...

Since multiple people have requested we are pushing a quick change to make this emphasized in the docs.

Thankyou. My own comment giving a quickstart scenario was downvoted :( https://news.ycombinator.com/item?id=40801453 but I saw you kindly replied to it! Thankyou, I appreciate that.
LlamaIndex can ingest directories if you want to do bulk.
What do you want to do with the data after ingesting?