Hacker News new | ask | show | jobs
by dan-robertson 1960 days ago
There seems to be a lot of work to get not very much. Maybe the point of this article is more about faffing around with quickcheck but the conclusion should basically be:

1. Parallel [efficient] MapReduce requires a commutative monoid for deterministic results

2. A naive implements of parallel MapReduce in haskell using par doesn’t because it will only do reduce operations in a certain order. Imagine if the map for every second element took 0s and the others took 10s. In normal mapreduce, you would reduce together the fast half of the results as they come in but in this haskell implementation you would need to sit on them and wait until the other adjacent elements come in. In the article’s very naive implementation, you can’t even reduce together the first two elements together until you’ve reduced everything after them (modulo some weirdness around lazy evaluation)

I think if one thinks of mapreduce as an operation on sets not lists, it should be obvious that the monoid must be commutative.

5 comments

I think the main point of the article is that you can use property based testing not as a software quality tool, but as a very quick way to verify/disprove your assumptions. The MapReduce stuff is just a simple enough example to illustrate the point.

I've used property tests the exact same way many times. In complex algorithms assumptions can be hard to think through, but a property testing tool can find counterexamples shockingly quickly.

But that's basically what the conclusion is, isn't it? The author set out with 'obviously the monoid must be commutative', fuzzed it and came to the second point you mention. This all may be very obvious to some, but I found it an interesting read that reminds that one should never fail to question one's assumptions and understanding.
I think that confusion stems from thinking about an operation that ought to be defined on sets, mapreduce, as one defined on lists. Your monoid obviously needs to be commutative for the former and not the latter. And the implementations are pretty similar in serial but quite different in performance in parallel.
3. A day of implementation can save 20minutes of design .
> I think if one thinks of mapreduce as an operation on sets not lists, it should be obvious that the monoid must be commutative.

If anything, you could think of it as an operation on multisets, but not sets. Sets don't have repeated elements.

I think the efficiency then hinges on the distribution of your slowness. Is there a slow mapping step for some elements? Then it should be fine having them all near each other. All fast-mapped values before and after can be reduced in the time the slow data is mapped. Only if somehow every second value is slowly mapped you have to wait for reduce to even start.

If your reduce is slow for some inputs, its better to have those inputs near each other, as all consecutive values before and after can be reduced during that time.

And if every step is somewhat constant time, I don't think reordering (that's basically the same as combining random elements instead of consecutive ones) the input does anything.

There’s two parts of the efficiency. One is memory: if you have a bunch of partial results hanging around for adjacent elements to reduce with, they will use memory. The other is the time and yes, if you have a map reduce on lists implementation that (unlike the author’s) reduces any two adjacent mapped elements then you’ll only be harmed by more pathological cases like the one I described, but I don’t think it’s that obvious that that case is unlikely (eg let’s say you’re processing some logs. You get one file per day from a system that runs from 9am to 5pm and one from a run from 5pm to 9am. It seems likely that one file would typically take longer than the other to process) And reordering your elements to optimise around this means that you’ll get different results, unless you’re using a commutative monoid! If your monoid isn’t commutative then reordering likely isn’t an option.