| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by pixelmonkey 2194 days ago

Michael Stonebraker has an interesting set of conclusions in his assessment of the MapReduce vendor market in 2015 from the "Dataflow" chapter here:

"- Just because Google thinks something is a good idea does not mean you should adopt it.

- Disbelieve all marketing spin, and figure out what benefit any given product actually has. This should be especially applied to performance claims.

- The community of programmers has a love affair with “the next shiny object”. This is likely to create “churn” in your organization, as the “half-life” of shiny objects may be quite short."

4 comments

sradman 2193 days ago

I reread DeWitt and Stonebraker’s (D&S) MapReduce criticism [1] and I still find it misguided 12 years later.

Map() is not equivalent to a SQL GROUP BY clause, it is equivalent to a user-defined Table Function that is used in a FROM clause. This mimics the Extract and Transform stages in a SQL ETL pipeline. The Extract is implied by the input format.

The Reduce() is very much equivalent to a user-defined Aggregate Function. D&S accurately criticize the sub-optimal materialization of intermediate data sets but they under appreciate the implicit input split and distributed sorting mechanism which dominated the Terasort benchmark at the time (a Jim Gray creation).

On-Premise commodity Hadoop clusters lost out to public Infrastructure-as-a-Service clusters. None of the five takedown categories turned out to be important. The tools have evolved and cloud-native data warehouses and ETL systems are now the best of both worlds.

[1] https://homes.cs.washington.edu/~billhowe/mapreduce_a_major_...

mrits 2193 days ago

"Map() is not equivalent to a SQL GROUP BY clause, it is equivalent to a user-defined Table Function that is used in a FROM clause."

No, the projection doesn't remove redundancy under most cases. There also isn't any reason you couldn't have UDF's in the GROUP BY clause. I've written implementations of both and I think the GROUP BY is an excellent comparison for understanding Map in MapReduce Systems.

sradman 2193 days ago

> the projection doesn't remove redundancy under most cases

Maybe I'm missing something; I don't understand why projections are part of this discussion. Maybe I should have been more precise. I was thinking about the type of Table UDF that Aster Data made popular around the time DeWitt and Stonebraker wrote their article (Jan 2008). These Table UDFs were written in languages like Java or C/C++ and generally accessed data external to the database engine. Aster Data marketing defined the functionality in terms of MapReduce.

The point I was trying to make was that the Map() part of MapReduce is equivalent to a distributed ETL pipeline. This remains one of the key use cases for Spark. The Reduce() part is no longer relevant in the new world of cheap and scalable column stores. DeWitt and Stonebraker's Teradata-like enterprise data warehouses suffered the same fate.

ramraj07 2193 days ago

When you say cloud-native data warehouses do you mean things like snowflake/redshift/big-query or something else? As part of an org making the transition from spark to these I can definitely agree that these tools are better suited for practical data engineering in the medium-big-data scale (anything not Google/Facebook)

sradman 2193 days ago

I was thinking AWS Athena (Presto) for the data warehouse and AWS Glue (Spark) for ETL. Redshift has always had the feel of a Column Store Appliance that runs side-by-side with your other IaaS resources. There is nothing particularly cloud-native about it other than the way it is provisioned and managed in the AWS web Console. Amazon QuickSight seems like an excellent alternative to Enterprise BI pivot tables like Tableau, Excel, PowerPivot, Business Objects, and Cognos. Amazon seems to be ahead of the competition (again) when it comes to ETL/DW/BI-as-a-Service, at least in terms of price-per-performance.

I don't know anything about Snowflake. SQL makes BigQuery and Hive easier to program than MapReduce/Pig but I don't think of these technologies as data warehouses.

Column Stores (compressed bitmap indexes batch updated with an ETL-like process) make exceptional data warehouses. Row oriented data warehouses all feel like anachronisms now.

jrumbut 2194 days ago

I think it's a bit of a shame that the MapReduce concept got the shiny object treatment since I thought it was a nice pragmatic approach to a useful set of problems that are faced all the time and often addressed with ad-hoc programs that make a mess.

People always looked down on those that used Hadoop or somesuch for <1GB of data, but while it wasn't needed from a technology perspective it gave a structure to the project.

Now many places are back in the world of one-off scripts, and I think something of value was lost (even if it was a little ridiculous to fire up a cluster for something Excel or SQLite could handle).

throwaway_pdp09 2193 days ago

> People always looked down on those that used Hadoop or somesuch for <1GB of data, but while it wasn't needed from a technology perspective it gave a structure to the project.

What 'structure'? Why is it so important that it makes it worthwhile firing up a large, complex framework? I'm beyond baffled.

cbcoutinho 2193 days ago

The same 'structure' that makes it easy to onboard new co-workers because they've seen the same project 'structure' before in the past. In that sense, the bottleneck in an organization is getting people productive as fast as possible, even that means using a cleaver instead of a scalpel.

throwaway_pdp09 2193 days ago

If all they can use is a massive cleaver (big data tools), and have no experience with scalpels (small, sharp, cheap and fast data tools), IMO your company has a serious, fundamental and systemic problem (no, let's call it failure) towards employee experience, training and knowledge. Edit: and resource management.

_jal 2193 days ago

Seems to be a sort of inverse of the massive spreadsheets that run supply chains on accretions of spaghetti-macros.

But, a tree chipper can serve as a paper shredder, and I imagine a lot of shops in certain markets saw it as a sort of prestige asset around 5-8 years back, when a bunch of companies started hiring data scientists for no apparent rational reason.

(Not bashing data scientists or data companies. Just remembering the fad that went around Bay Area companies a while ago.)

o1lab 2193 days ago

>> (even if it was a little ridiculous to fire up a cluster for something Excel or SQLite could handle)

I know above comment will be lost - but this is such a genuine truth.

ses1984 2193 days ago

I'm sorry but if your problem can be solved on excel then hire people who are good with excel, not people good with Hadoop.

MaysonL 2193 days ago

Things tend to evolve - a system that worked on excel yesterday may well be a dangerous un maintainable monster next week.

exdsq 2193 days ago

> The community of programmers has a love affair with “the next shiny object”. This is likely to create “churn” in your organization, as the “half-life” of shiny objects may be quite short."

This is an interesting thought. A company uses shiny tech because programmers like using them for whatever reason. This attracts employees who want to use this tech too. The half-life for shiny tech is short and so these developers move on to shinier pastures. I wonder if this explains why people change jobs so often in tech? I’m sure I read the average tenure is much lower (~1.5years) compared to other industries.

dinosaurdynasty 2193 days ago

I'm pretty sure it's the raises people tend to get jumping companies compared to what they get if they stay at a company.

andrewflnr 2193 days ago

If anything I would expect the causality to run the other direction, i.e. resume driven development to make sure they can get a new job and therefore a raise.

atombender 2194 days ago

Well, he's absolutely right. MapReduce apparently didn't last that long at Google — it's long been supplanted by other technologies internally.

century19 2193 days ago

But MapReduce has long been superseded by Spark outside of Google right?

vikiomega9 2194 days ago

Which ones?

atombender 2194 days ago

As far as I remember, mostly Flume (open sourced as Apache Beam and also known as Cloud Dataflow on GCP).

Flume/Beam still provides map/reduce as operations, mind you, but with a much richer processing model.