Hacker News new | ask | show | jobs
by tjhunter 881 days ago
(2nd user & developer of spark here). It depends on what you ask.

MapReduce the framework is proprietary to Google, and some pipelines are still running inside google.

MapReduce as a concept is very much in use. Hadoop was inspired by MapReduce. Spark was originally built around the primitives of MapReduce, and you see still see that in the description of its operations (exchange, collect). However, spark and all the other modern frameworks realized that:

- users did not care mapping and reducing, they wanted higher level primitives (filtering, joins, ...)

- mapreduce was great for one-shot batch processing of data, but struggled to accomodate other very common use cases at scale (low latency, graph processing, streaming, distributed machine learning, ...). You can do it on top of mapreduce, but if you really start tuning for the specific case, you end up with something rather different. For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.

3 comments

There really was always only Map and Shuffle (Reduce is just Shuffle+Map; also another name for Shuffle is GroupByKey). And you see those primitives under the hood of most parallel systems.
Shuffle is interesting, I gotta read up on that. Maybe I've been hearing reduce for too long and have too much of a built-in visual sense of it but...shuffle does not seem like the right name at all, then I picture randomizing some set N, where the input and output counts are the same.
Shuffle is an operation that converts "{k1, v1}, {k1, v2}, {k2, v3}" into "{k1, [v1, v2]}, {k2, [v3]}".
Reduce is useful for aggregate metrics.
My point is that Reduce is Shuffle+Map, without materializing the intermediate result (the result after Shuffle).
For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.

Are you confusing kafka with something else? Kafka is a persistent write append queue.