| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by tjhunter 881 days ago

(2nd user & developer of spark here). It depends on what you ask.

MapReduce the framework is proprietary to Google, and some pipelines are still running inside google.

MapReduce as a concept is very much in use. Hadoop was inspired by MapReduce. Spark was originally built around the primitives of MapReduce, and you see still see that in the description of its operations (exchange, collect). However, spark and all the other modern frameworks realized that:

- users did not care mapping and reducing, they wanted higher level primitives (filtering, joins, ...)

- mapreduce was great for one-shot batch processing of data, but struggled to accomodate other very common use cases at scale (low latency, graph processing, streaming, distributed machine learning, ...). You can do it on top of mapreduce, but if you really start tuning for the specific case, you end up with something rather different. For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.

3 comments

dataflow 881 days ago

Also see https://news.ycombinator.com/item?id=37313576

link

H8crilA 881 days ago

There really was always only Map and Shuffle (Reduce is just Shuffle+Map; also another name for Shuffle is GroupByKey). And you see those primitives under the hood of most parallel systems.

link

refulgentis 881 days ago

Shuffle is interesting, I gotta read up on that. Maybe I've been hearing reduce for too long and have too much of a built-in visual sense of it but...shuffle does not seem like the right name at all, then I picture randomizing some set N, where the input and output counts are the same.

link

H8crilA 880 days ago

Shuffle is an operation that converts "{k1, v1}, {k1, v2}, {k2, v3}" into "{k1, [v1, v2]}, {k2, [v3]}".

link

lupire 881 days ago

Reduce is useful for aggregate metrics.

link

H8crilA 880 days ago

My point is that Reduce is Shuffle+Map, without materializing the intermediate result (the result after Shuffle).

link

VirusNewbie 881 days ago

For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.

Are you confusing kafka with something else? Kafka is a persistent write append queue.

link