| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by jsteemann 2848 days ago

Thanks! We have a very brief description of the "distributed COLLECT" feature here: https://github.com/arangodb/arangodb/blob/3.4/Documentation/...

More beef to be added to this until the GA release.

The benefits of distributed COLLECT will come into play for queries that can push the aggregate operations onto the shards. Previous versions of ArangoDB shipped all documents from the database servers to the coordinator, so the coordinator would do the central aggregation of the results from all shards to produce the result.

With distributed COLLECT we now create an additional shard-local COLLECT operation that performs part of the aggregation on the shards already. This allows sending just the aggregated per-shard results to the coordinator, so the coordinator can finally perform an aggregation of the per-shard aggregates.

This will be beneficial in many cases when the per-shard aggregated result is much smaller than the non-aggregated per-shard result.

Following is a very simple example. Let's say you have a collection "test" with 5 shards and 500k simple documents that have just one numeric attribute (plus the three system attributes "_key", "_id" and "_rev"):

    db._create("test", { numberOfShards: 5 }); 
    for (i = 0; i < 500000; ++i) {
      db.test.insert({ value: i });
    }

Running a query that will calculate the minimum and maximum values in the "value" attribute can make use of the distributed COLLECT:

    FOR doc IN test 
      COLLECT AGGREGATE min = MIN(doc.value), max = MAX(doc.value) 
      RETURN { min, max }

The database servers can compute the per-shard minimum and maximum values, so they will each only send two numeric values back to the coordinator.

Without the optimization, the database servers will either send the entire documents or a projection of each document (containing just each document's "value" attribute back) to the coordinator. But then each shard would still have to send 100k values on average.

With a local cluster that has 2 database servers and runs them on the same host as the coordinator, this simple query is sped up by a factor of 2 to 3 when the optimization is applied. In a "real" setup the speedup will be even higher because then there will be additional network roundtrips between the cluster nodes. And in reality documents tend to contains more data and collections tend to have more documents. If this is the case, then the speedup will be even higher.