Hacker News new | ask | show | jobs
by maweki 1952 days ago
I think the efficiency then hinges on the distribution of your slowness. Is there a slow mapping step for some elements? Then it should be fine having them all near each other. All fast-mapped values before and after can be reduced in the time the slow data is mapped. Only if somehow every second value is slowly mapped you have to wait for reduce to even start.

If your reduce is slow for some inputs, its better to have those inputs near each other, as all consecutive values before and after can be reduced during that time.

And if every step is somewhat constant time, I don't think reordering (that's basically the same as combining random elements instead of consecutive ones) the input does anything.

1 comments

There’s two parts of the efficiency. One is memory: if you have a bunch of partial results hanging around for adjacent elements to reduce with, they will use memory. The other is the time and yes, if you have a map reduce on lists implementation that (unlike the author’s) reduces any two adjacent mapped elements then you’ll only be harmed by more pathological cases like the one I described, but I don’t think it’s that obvious that that case is unlikely (eg let’s say you’re processing some logs. You get one file per day from a system that runs from 9am to 5pm and one from a run from 5pm to 9am. It seems likely that one file would typically take longer than the other to process) And reordering your elements to optimise around this means that you’ll get different results, unless you’re using a commutative monoid! If your monoid isn’t commutative then reordering likely isn’t an option.