Hacker News new | ask | show | jobs
by MillstoneX 4236 days ago
Can you give an example for a join between two different sharded document collections that can be executed efficiently on say 100 servers?
3 comments

Say you have one collection for your people (sharded over 100 servers, say) and another one for conferences (also sharded over 100 machines). Then you could hold the primary keys of all conferences a person attended in a JSON list stored in an attribute with the user. A query finding all people with last name "Jones" that have attended a given conference can now be executed efficiently by using a secondary index on the last name of people and performing a key lookup in the conferences collection. The latter only has to talk to one shard, if the conferences collection is sharded by key and can thus be done efficiently. Obviously, one needs a query optimizer that is aware of the distribution of the shards and the shard keys, but this is certainly doable.
just because the dataset is sharded doesn't mean that one query has to hit every shard. for example, suppose you're looking for documents with `parent_id = foo` and your sharding key is `parent_id`, then an intelligent query planner would only query one shard (the one that "foo" hashes to), and then this looks a lot like a join in an RDBMS. indeed, if you wanted to do (in RDBMS terms) a self-join to load the whole tree of documents rooted at parent_id = foo, and your sharding key were the root for each document, that query would only hit one shard with a. the trick is deciding which keys to shard on (and, in many cases, what other keys to shard on in redundant datastores that serve different types of queries).
Right, you were quicker but are essentially saying the same thing as I said in my example.
Thanks for both your answers, this is really interesting indeed. I always thought that joins are a "no, no, no" in the NoSQL world. This opens up a whole lot of new possibilities. I will have to have a look at this ArangoDB thing...