| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by dwenzek 4125 days ago

Interesting but missing a bit of rigor up to being wrong.

If we can compute a sum as a sum of sub-sums, or a count as sum of sub-counts, this is no because addition is commutative `a+b=b+a` (i.e order of operands doesn't matter), but because addition is associative `(a+b)+c=a+(b+c)` (i.e order of operations doesn't matter).

So we can define, `sum(a,b,..,z) = a+b+..+z` whatever is the order of the operations (say `(((a+b)+c)+..+z)` or `(a+(b+(c+...+z)))` or ...). And if we have to compute the sum of two lists xs and ys, we can compute either `sum(append(xs,ys))` or `sum(sum(xs),sum(ys))`.

Likewise, if we can "pull up Collect nodes and push down Computation nodes" in some cases this is not because the involved computation commutes, but because the operation is in some sense compatible with the collection structure and its collect operation.

What we need is an associative operation used to merge the results computed on parts of the collection, so:

  computation(merge_dataset(xs,ys)) = merge_results(computation(xs),computation(ys))

For summation and counting, the merge operation is addition. For filtering the merge operation is simply the former collection collect operation. So we have:

        sum(append(xs,ys)) = sum(sum(xs),sum(ys))
        count(append(xs,ys)) = sum(count(xs),count(ys))
        filter(append(xs,ys)) = append(filter(xs),filter(ys))

The abstract concept behind all this is monoid homomorphims (1) and, if you are looking for further readings, I wrote a post on how this concept is related to map-reduce and parallelism (2).

- (1) https://en.wikipedia.org/wiki/Monoid#Monoid_homomorphisms.

- (2) http://acidalie.free.fr/unfoldvalue/blog/map-reduce-spirit.h...