|
|
|
|
|
by mccanne
811 days ago
|
|
Necessity is the mother of invention. MapReduce-based systems were developed because the state-of-the-art RDBMS systems of that age could not scale to the
needs of the Googles/Yahoos/Facebooks during the phenomenal growth spurt of the early Web. The novelty here was the tradeoffs they made to scale out and up using the compute and storage footprints available at the time. "We thought of that" vs "we built it and made it work". |
|
Google built MR because it was in an existential crisis: they couldn't build a new index for the search engine, and freshness and size of the index was important for early search engines. The previous tools would crash part-way through due to the cheap hardware that Google bought. If Google had based search indexing on RDBMS, they would not exist today.
Now Google did use RDBMS- they used MySQL at scale. It wasn't unheard-of for mapreduces to run against MySQL (typically doing a query to get a bunch of records, and then mapping over those records).
I worked on later mapreduce (long after it was mature) which used all sorts of tricks to extend the MapReduce paradigm as far as possible but ultimately nearly everything got replace with Flume, which is effectively a computational superset of what MR can do.
I think the paper must have been pulled because Stonebreaker must have gotten huge pushback for attacking MR for something it wasn't good at. See the original paper for what they proposed as good use cases: counting word occurences in a large corpus (far larger than the storage limits of postgres and others at the time), distributed grep (without an index), counting unique items (where the number of items is larger than the capacity of a database at the time), reversing a graph (convert (source, target) pairs to (target, [source, source, source]), term vectors, inverted index (the original use case for building the index) and distributed sort. None of the RDBMS of that day could handle the scale of the web. That's all.