| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by alextp 5149 days ago

Mapreduce as a concept goes beyond lisp implementations. On the surface it might seem like the point of mapreduce is expressing computations in terms of map and reduce functions. It isn't.

The point of mapreduce is reducing the problem of high-throughput fault-tolerant distributed systems to a very efficient and reliable distributed sorting algorithm (the shuffle phase, which is implemented by the implementations of mapreduce and not by the user code). If you can express all synchronization in your algorithm in terms of sorting, then whatever you do before sorting (map) or after it (reduce) is kind of trivial, as the hard part is taken care of by the framework.

This abstraction is novel, and profoundly useful, and that's the point of mapreduce, not so much the actual map() and reduce() functions.

1 comments

fa_il 5149 days ago

Sorry, but applying an old concept to a new problem (actually just new buzzwords... it's only the size of the problem that's new) does not make a "novel" solution. Moreover, it's an obvious solution. But I guess that depends on who is doing the programming.

I would love to see how programmers with large clusters at their disposal were approaching large datasets before the moment they realized splitting the task into smaller pieces was what they should do.

alextp 5148 days ago

It's not about splitting the task into smaller pieces. It's about factoring out the parts of the task that need synchronization among all machines into one specific subroutine (groupBy) which makes mapreduce so powerful.

If you speak with people experienced in multithreaded and distributed programming you will see that synchronization with fault-tolerance is _hard_, and mapreduce provides a widely-applicable set of sufficient conditions for an algorithm to be executable with implicit fault-tolerance and implicit synchronization.

Without mapreduce-like abstractions eveyr piece of software has to be responsible for its own (1) checkpointing (to recover from errors), (2) checksumming (to ensure that no errors happened), and (3) distributed communication (to make sure the global state becomes global and the local state becomes local).