Even with Hadoop, there's still algorithmic decisions that you have to make like whether to choose a stripes or pairs strategy when synchronization is needed[1], or choosing appropriate data structures for real time queries [2]. Sure something like Hive and Pig might work well enough for certain queries, but for some of the more complex queries where writing bare MapReduce or Spark is needed then these data structures and algorithms concerns pop up quite fast.
MapReduce is filled with algorithms. The core system is basically a merge sort engine (the shuffler). In fact, if you ever wondered about the question "sort 2M integers in 1M RAM", well, that was Jeff asking people if they could understand that shuffle was out-of-core.
Another example is lexicographic range sharding, which uses reservoir sampling to compute optimal tablet key split points by doing a constant-space-and-time heuristic sampling over the keyspace.
I used to think MR was just brute force, but it has many levels of algorithms. Probably too many- at some point it because hard to analyze how the system worked because of the various kinds of hedging and recovery strategies.
Another example is lexicographic range sharding, which uses reservoir sampling to compute optimal tablet key split points by doing a constant-space-and-time heuristic sampling over the keyspace.
I used to think MR was just brute force, but it has many levels of algorithms. Probably too many- at some point it because hard to analyze how the system worked because of the various kinds of hedging and recovery strategies.