| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by joshring 849 days ago

Well, as someone building something similar I have been looking around at how people are tackling the problem of varied index approaches for different files, and again how that can scale.

I haven't read the code on your github but the readme mentions using qdrant/pgvector. I'm curious how you will tackle having that scale to billions of files with tens/hundreds/etc? different indexing approaches for each file. It doesn't feel tennable to keep it in a single postgres instance as it will just grow and grow forever.

Think even a very simple example of more indexes per file: having chunk sizes of 20/500/1000 along with various overlaps of 50/100/500. You suddenly have a large combination of indexes you need to maintain and each is basically a full copy of the source file. (You can imagine indexes for BM25, fuzzy matching, lucene, etc...)

You could be brute force ish and always run every single index mode for every file until a better process exists to only do the best ones for a specific file. But even if you narrowed it down a file could want 5 different index types searched and ranked for Retrieval step.

I want to know how people plan to shard/make it possible to have so many search indexes on all their data and still be able to query against all of it. Postgres will eventually run out of space even on the beefiest cloud instance fairly quickly.

The second biggest thing is then to tackle how to use all of those indexes well in the Retrieval step. Which indexes should be searched against/weighted and how given the user query/convo history?