We like Federated search, we like decentralized search, and even P2P search; we are trying to find a good mix, and decided to get started rather than wait! Exciting times.
I'm not trying to be dismissive, it's just my feeling from working on search.marginalia.nu is that nearly every aspect of search benefits from locality, not only is the full crawl-set instrumental in determining both domain rankings and relevance signals on a term-level such as anchor tag keywords; but the way an inverted index is typically set up is extremely disk cache friendly where the access pattern for checking the first document warms up the cache for the other queries, but that discount obviously only exists when it's the same cache.
You could get people creating indexes with love such as your own. marginalia could become the de-facto index for long form content. However you probably arent that interested in running the best pokemon website, so someone else could do that.
Enough people add domain specific search endpoints, with perhaps a taxonomy to say "hey send those sort of queries over here" and you have a compelling engine that self heals should someone stop running things, or starts spamming.
You can also integrate search results for which you cannot have the index, like social media APIs, another reason.
You could also mix and match search results from various topic-oriented indices. That's a research question, whether that is really better than building one unified one. But we think it is the way to bring index fragments to the edge, with the obvious privacy advantages.
I would love to be able to run a node that mirrors part or all of an index like this, and to let people query it - a bit like https://torrents-csv.ml/#/
Good luck! I'll be watching your progress and cheering you all on!
I'm not trying to be dismissive, it's just my feeling from working on search.marginalia.nu is that nearly every aspect of search benefits from locality, not only is the full crawl-set instrumental in determining both domain rankings and relevance signals on a term-level such as anchor tag keywords; but the way an inverted index is typically set up is extremely disk cache friendly where the access pattern for checking the first document warms up the cache for the other queries, but that discount obviously only exists when it's the same cache.