| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by atombender 884 days ago
	That's my understanding. MR is very simplistic and awkward/impossible to express many problems in, whereas dataflow processors like Spark and Apache Beam support creating complex DAGs of rich set of operators for grouping, windowing, joining, etc. that you just don't have in MR. You can do MR within a DAG, so you could say that dataflows are a generalization or superset of the MR model.

1 comments

danpalmer 884 days ago

> You can do MR within a DAG, so you could say that dataflows are a generalization or superset of the MR model.

I think it's the opposite of this. MapReduce is a very generic mechanism for splitting computation up so that it can be distributed. It would be possible to build Spark/Beam and all their higher level DAG components out of MapReduce operations.

link

atombender 884 days ago

I don't mean generalization that way. Dataflow operators can be expressed as MR as the underlying primitive, as you say. But MR itself, as described in the original paper at least, only has the two stages, map and reduce; it's not a dataflow system. And it turns out people want dataflow systems, not hand-code MR and do the DAG manually.

link

eru 884 days ago

I'm not sure what you describe is the opposite?

I mean, you can implement function calls (and other control flow operators like exceptions or loops) as GOTOs and conditional branches, and that's what your compiler does.

But that doesn't really mean it's useful to think of GOTOs being the generalisation.

Most of the time, it's just the opposite: you can think of a GOTO as a very specific kind of function call, a tail-call without any arguments. See eg https://www2.cs.sfu.ca/CourseCentral/383/havens/pubs/lambda-...

link