Hacker News new | ask | show | jobs
by neunhoef 4240 days ago
Interesting article. An obvious reaction is to say: "In a document store, not all joins will be efficient in a sharding situation!". This is true, but certain queries involving joins backed by the right secondary indexes will indeed scale well, therefore one should not use this argument as a reason not to implement joins at all.
1 comments

Can you give an example for a join between two different sharded document collections that can be executed efficiently on say 100 servers?
Say you have one collection for your people (sharded over 100 servers, say) and another one for conferences (also sharded over 100 machines). Then you could hold the primary keys of all conferences a person attended in a JSON list stored in an attribute with the user. A query finding all people with last name "Jones" that have attended a given conference can now be executed efficiently by using a secondary index on the last name of people and performing a key lookup in the conferences collection. The latter only has to talk to one shard, if the conferences collection is sharded by key and can thus be done efficiently. Obviously, one needs a query optimizer that is aware of the distribution of the shards and the shard keys, but this is certainly doable.
just because the dataset is sharded doesn't mean that one query has to hit every shard. for example, suppose you're looking for documents with `parent_id = foo` and your sharding key is `parent_id`, then an intelligent query planner would only query one shard (the one that "foo" hashes to), and then this looks a lot like a join in an RDBMS. indeed, if you wanted to do (in RDBMS terms) a self-join to load the whole tree of documents rooted at parent_id = foo, and your sharding key were the root for each document, that query would only hit one shard with a. the trick is deciding which keys to shard on (and, in many cases, what other keys to shard on in redundant datastores that serve different types of queries).
Right, you were quicker but are essentially saying the same thing as I said in my example.
Thanks for both your answers, this is really interesting indeed. I always thought that joins are a "no, no, no" in the NoSQL world. This opens up a whole lot of new possibilities. I will have to have a look at this ArangoDB thing...