|
For OP's benefit, here are some excerpts from the red book that agree with that premise: > Google MapReduce set back by a decade the conversation about adaptivity of data in motion, by baking blocking operators into the execution model as a fault-tolerance mechanism. It was nearly impossible to have a reasoned conversation about optimizing dataflow pipelines in the mid-to-late 2000’s because it was inconsistent with the Google/Hadoop fault tolerance model. In the last few years the discussion about execution frameworks for big data has suddenly opened up wide, with a quickly-growing variety of dataflow and query systems being deployed that have more similarities than differences http://www.redbook.io/ch7-queryoptimization.html Also see Stonebraker's comment at the bottom here: http://www.redbook.io/ch5-dataflow.html edit: To be more charitable, Mapreduce's main concern was fault tolerance (and recovery) and massive scalability, at the cost of all else. Since it's so simple, you could have subtasks die, disappear, and yet you can just respawn them and keep on chugging through the query. You also don't think too hard about job allocation. It's easy to build and use, easy to reason about. You can throw more computers at it when you have a spike of jobs, and it scales fairly predictably. Not many people were really running infrastructure and jobs at the scale google did, and that's quite different from the traditional "data warehouse" style application, and so it wasn't entirely unjustified. The other benefit, of course, is that you can perform arbitrary computation, which is quite different from most RDBMSes which often don't have great UDF support or are frequently highly restricted and, frankly, horrific to deal with. Of course, they quickly found that "no query optimization" is sort of an extremist and unproductive position, and that you can have a bit of either or both cakes as needed. |
https://github.com/frankmcsherry/blog/blob/master/posts/2017...
I don't use databases because they are really quite bad at computation.
In my opinion, the main recent novelty in query planning has been the work on worst-case optimal joins, stuff like EmptyHeaded[1] and the recent FAQ work[2].
[1]: https://arxiv.org/abs/1503.02368
[2]: https://arxiv.org/abs/1504.04044