|
|
|
|
|
by sradman
2196 days ago
|
|
I reread DeWitt and Stonebraker’s (D&S) MapReduce criticism [1] and I still find it misguided 12 years later. Map() is not equivalent to a SQL GROUP BY clause, it is equivalent to a user-defined Table Function that is used in a FROM clause. This mimics the Extract and Transform stages in a SQL ETL pipeline. The Extract is implied by the input format. The Reduce() is very much equivalent to a user-defined Aggregate Function. D&S accurately criticize the sub-optimal materialization of intermediate data sets but they under appreciate the implicit input split and distributed sorting mechanism which dominated the Terasort benchmark at the time (a Jim Gray creation). On-Premise commodity Hadoop clusters lost out to public Infrastructure-as-a-Service clusters. None of the five takedown categories turned out to be important. The tools have evolved and cloud-native data warehouses and ETL systems are now the best of both worlds. [1] https://homes.cs.washington.edu/~billhowe/mapreduce_a_major_... |
|
No, the projection doesn't remove redundancy under most cases. There also isn't any reason you couldn't have UDF's in the GROUP BY clause. I've written implementations of both and I think the GROUP BY is an excellent comparison for understanding Map in MapReduce Systems.