| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by davedx 724 days ago
	I’ve checked out quite a few RAG projects now and what I haven’t seen really solved is ingestion, it’s usually like “this is an endpoint or some connectors, have fun!”. How do I do a bulk/batch ingest of say, 10k html documents into this system?

4 comments

ocolegro 724 days ago

All the pipelines are async, so for ingestion we have typically seen that R2R can saturate the vector db or embedding provider. We don't yet have backpressure so it is up to the client to rate limit.

Ingestion is pretty straightforward, you can call R2R directly or use the client-server interface to pass the html files in directly to the ingest_files endpoint (https://r2r-docs.sciphi.ai/api-reference/endpoint/ingest_fil...).

The data parsers are all fairly simple and easy to customize. Right now we use bs4 for handling HTML but have been considering other approaches.

What specific features around ingestion have you found lacking?

link

davedx 724 days ago

Thanks, I’ll give it a try!

link

vintagedave 724 days ago

I'd like to know this too. A quick: "take these docs as input, ingest and save, now sit there providing an API to get results" service guide.

link

ocolegro 724 days ago

Take a look here - https://r2r-docs.sciphi.ai/quickstart#ingest-data and here https://r2r-docs.sciphi.ai/cookbooks/client-server#ingest-do...

Since multiple people have requested we are pushing a quick change to make this emphasized in the docs.

link

vintagedave 724 days ago

Thankyou. My own comment giving a quickstart scenario was downvoted :( https://news.ycombinator.com/item?id=40801453 but I saw you kindly replied to it! Thankyou, I appreciate that.

link

shepardrtc 724 days ago

LlamaIndex can ingest directories if you want to do bulk.

link

namanyayg 724 days ago

What do you want to do with the data after ingesting?

link