Ask HN: Does (or why does) anyone use MapReduce anymore? | HN Mirror

Y	Hacker News new \| ask \| show \| jobs

	Ask HN: Does (or why does) anyone use MapReduce anymore?
	106 points by bk146 875 days ago
	Excluding the Hadoop ecosystem, I see some references to MapReduce in other database and analysis tools (e.g., MatLab). My perception was that Spark completely superseded MapReduce. Are there just different implementations of MapReduce and the one that Hadoop implemented was replaced by Spark?

15 comments

tjhunter 875 days ago

(2nd user & developer of spark here). It depends on what you ask.

MapReduce the framework is proprietary to Google, and some pipelines are still running inside google.

MapReduce as a concept is very much in use. Hadoop was inspired by MapReduce. Spark was originally built around the primitives of MapReduce, and you see still see that in the description of its operations (exchange, collect). However, spark and all the other modern frameworks realized that:

- users did not care mapping and reducing, they wanted higher level primitives (filtering, joins, ...)

- mapreduce was great for one-shot batch processing of data, but struggled to accomodate other very common use cases at scale (low latency, graph processing, streaming, distributed machine learning, ...). You can do it on top of mapreduce, but if you really start tuning for the specific case, you end up with something rather different. For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.

dataflow 875 days ago

Also see https://news.ycombinator.com/item?id=37313576

H8crilA 875 days ago

There really was always only Map and Shuffle (Reduce is just Shuffle+Map; also another name for Shuffle is GroupByKey). And you see those primitives under the hood of most parallel systems.

refulgentis 875 days ago

Shuffle is interesting, I gotta read up on that. Maybe I've been hearing reduce for too long and have too much of a built-in visual sense of it but...shuffle does not seem like the right name at all, then I picture randomizing some set N, where the input and output counts are the same.

H8crilA 874 days ago

Shuffle is an operation that converts "{k1, v1}, {k1, v2}, {k2, v3}" into "{k1, [v1, v2]}, {k2, [v3]}".

lupire 875 days ago

Reduce is useful for aggregate metrics.

H8crilA 874 days ago

My point is that Reduce is Shuffle+Map, without materializing the intermediate result (the result after Shuffle).

VirusNewbie 875 days ago

For example, kafka (scalable streaming engine) is inspired by the general principles of MR but the use cases and APIs are now quite different.

Are you confusing kafka with something else? Kafka is a persistent write append queue.

dtoma 875 days ago

The "streaming systems" book answers your question and more: https://www.oreilly.com/library/view/streaming-systems/97814.... It gives you a history of how batch processing started with MapReduce, and how attempts at scaling by moving towards streaming systems gave us all the subsequent frameworks (Spark, Beam, etc.).

As for the framework called MapReduce, it isn't used much, but its descendant https://beam.apache.org very much is. Nowadays people often use "map reduce" as a shorthand for whatever batch processing system they're building on top of.

nwsm 875 days ago

This book looks interesting, should I buy it or does anyone else have newer recommendations? I have Designing Data-Intensive Applications which is a fantastic overview and still holds up well.

erikerikson 875 days ago

That was one of "the" books in the space prior to DDIA. In my opinion Akidao mixes the logic for processing events with the stream infrastructure implementation because he was writing from the context of his particular use cases. The time that I spoke with him it seemed that his influence had driven to the design of Google's systems and GCP such that they didn't properly prioritize ordering/linearity/consistency requirements. At this point my copy is of historic interest to me.

62951413 875 days ago

It has the most interesting/conceptual/detailed discussion of the streaming system semantics (e.g. interplay of windows and stateful stream operations) I'm aware of to this day. At least as far as Manning/O'Reilly-level books go. So I'd put it on the same bookshelf as DDIA.

It's a little biased towards Beam and away from Spark/Flink though. Which makes it less practical and more conceptual. So as long as it's your cup of tea go for it.

bk146 875 days ago

Thank you, I'll check this out!

throwaway5959 875 days ago

I feel like that’s kind of like saying we don’t use Assembly anymore now that we have C. We’ve just built higher level abstractions on top of it.

falcor84 875 days ago

Yeah, that's exactly how I read the question, i.e. analogous to "does anyone still code in assembly, or has everyone switched to using abstractions?" and I think it's a very interesting one.

KaiserPro 875 days ago

Its more like does anyone use goto.

Why use map:reduce when you can have an entire DAG for fanout/in?

dehrmann 875 days ago

At a high level, most distributed data systems look something like MapReduce, and that's really just fancy divide-and-conquer. It's hard to reason about, and most data at this size is tabular, so you're usually better off using something where you can write a SQL query and let the query engine do the low-level map-reduce work.

BenoitP 875 days ago

The concept is quite alive, and the fancy deep learning have it: jax.lax.map, jax.lax.reduce.

It's going to stay because it is useful:

Any operation that you can express with an associative behavior is automatically parallelizeable. And both in Spark and Torch/Jax this means scalable to a cluster, with the code going to the data. This is the unfair advantage of solving bigger problems.

If you were talking about the Hadoop ecosystem, then yes Spark pretty much nailed it and is dominant (no need to have another implementation)

atombender 875 days ago

That's my understanding. MR is very simplistic and awkward/impossible to express many problems in, whereas dataflow processors like Spark and Apache Beam support creating complex DAGs of rich set of operators for grouping, windowing, joining, etc. that you just don't have in MR. You can do MR within a DAG, so you could say that dataflows are a generalization or superset of the MR model.

danpalmer 875 days ago

> You can do MR within a DAG, so you could say that dataflows are a generalization or superset of the MR model.

I think it's the opposite of this. MapReduce is a very generic mechanism for splitting computation up so that it can be distributed. It would be possible to build Spark/Beam and all their higher level DAG components out of MapReduce operations.

atombender 875 days ago

I don't mean generalization that way. Dataflow operators can be expressed as MR as the underlying primitive, as you say. But MR itself, as described in the original paper at least, only has the two stages, map and reduce; it's not a dataflow system. And it turns out people want dataflow systems, not hand-code MR and do the DAG manually.

eru 875 days ago

I'm not sure what you describe is the opposite?

I mean, you can implement function calls (and other control flow operators like exceptions or loops) as GOTOs and conditional branches, and that's what your compiler does.

But that doesn't really mean it's useful to think of GOTOs being the generalisation.

Most of the time, it's just the opposite: you can think of a GOTO as a very specific kind of function call, a tail-call without any arguments. See eg https://www2.cs.sfu.ca/CourseCentral/383/havens/pubs/lambda-...

dikei 875 days ago

MapReduce was basically a very verbose/imperative way to perform scalable, larger than memory aggregate-by-key operation.

It was necessary as a first step, but as soon as we had better abstraction, everyone stopped using it directly except for legacy maintenance of course.

lupire 875 days ago

The abstraction came first. MapReduce was quickly used as a basis for larger-than-machine SQL (Google Dremel and Hadoop Pig). MapReduce was separately useful when the processing pieces require a lot of custom code that doesn't fit well into SQL (because you have hierarchical records, not purely relational, for example)

DeathArrow 875 days ago

Can you point, please, to the better abstractions?

willvarfar 875 days ago

SQL comes to mind.

Every time you run an SQL query on BigQuery, for example, you are executing those same fundamental map shuffle primitives on underlying data, it's just that the interface is very different.

qoega 875 days ago

Now you rarely use basic MapReduce primitives, you have another layer of abstraction that can run on infrastructure that was running MR jobs before. This infrastructure allows to efficiently allocate some compute resources for "long" running tasks in a large cluster with respect to memory/cpu/network and other constraints. So basically schedulers of MapReduce jobs and cluster management tools became that good, because MR methodology had trivial abstractions, but required efficient implementation to make it work seamlessly.

Abstraction layers on top of this infrastructure now can optimize pipeline as a whole by merging several steps into one when possible, add combiners(partial reduce before shuffle). It requires whole processing pipeline to be defined in more specific operations. Some of them propose to use SQL to formulate task, but it can be done using other primitives. And given this pipeline it is easy to implement optimizations making whole system much more user-friendly and efficient compared to MapReduce, when user has to think about all the optimizations and implement them inside single map/reduce/(combine) operations.

monero-xmr 875 days ago

The correct language for querying data is, as always, SQL. No one cares about the implementation details.

“I have data and I know SQL. What is it about your database that makes retrieving it better?”

Any other paradigm is going to be a niche at best, likely outright fail.

esafak 875 days ago

Spark is really failing, all right.

SQL lacks type safety, testability, and composability.

monero-xmr 875 days ago

It’s crazy to think how old I am now. But give it 20 more years and you’ll come around.

ramses0 875 days ago

Agree/disagree. I wish (and maybe this is "the answer") that you could take a basic SQL query and "invert it" into composable components.

A common thing I ended up doing for some "small data" hack projects was extremely liberal usage of SQLite: SELECT ... UNION ( SELECT ... ... GROUP BY ... ( UNION ... etc ) ) ... absolutely terrible SQL, but it got the job done to return the 100 or so records I was interested in.

It'd be great if I could write me some SQL then pop it out into: fn_group001, fn_join(g1, g2, cols(c1, c2)), ...etc...

...and then have composable sub-components of what the janky SQL-COBOL syntax supports, but in a group().chain().join(...) style.

I think I keep running across DataLog as something that's recommended, and of course ProLog has some similarities.

Nothing has been compelling enough to warrant jumping off of SQL, but I really do agree with the grandparent comment: SQL (aka: COBOL) is pretty clunky and non-composable in a way that complicates what you'd think would be straightforward for interactive, non-programming usage.

revscat 875 days ago

20 years ago SQL lacked type safety, testability, and composability. Today the same is true. I doubt it will be different 2 decades from now.

SQL is powerful. It is also very old and has very large warts.

api 875 days ago

I think one of the biggest missed opportunities in language design is the integration of powerful relational database and query models directly into a modern language. Not as a bolt-on or an ORM but as a first class part of the language in the same way as maps and arrays. Make a language that deals with data relationally and where relational queries are a core part of the language and relational query execution is baked into its runtime.

If persistence hooks were also baked in then you'd have something a little bit like stored procedures in databases but far more powerful and with a modern syntax. Couple this with a distributed database layer supporting either eventual consistency built on CRDTs or synchronization via raft/paxos and you'd have an amazing application platform.

It's always seemed dumb to me that data, which is in the very center of everything we do, feels like a bolted-on second class citizen from the perspective of pretty much all programming languages and runtime environments. "Oh, you want to work with your data? Well we didn't think about that..." Accessing the data requires weird incantations and hacks that feel like you're entering a 1970s time warp into a PDP-11 mainframe.

Instead the language and runtime environment should be built around the data. Put the data in the center like Copernicus did with the sun.

Why has nobody done this? Has anyone even tried?

monero-xmr 875 days ago

Data access is of primary concern to all programming languages, using either memory or disk. All files on disk can be considered a form of database. Reading and writing files is standard in all languages.

Once you start getting fancier in your files, and the data grows large, you need special ways to read it. A Postgres database can be considered a single big file on disk. It is the Postgres server that is required to access the file in the most efficient way to store and randomly access enormous amounts of general data.

SQLite is interesting in that there is no server, it's just a special library that enables efficient random access of a single file, which can be thought of as a black box that only SQLite knows how to interpret.

Unless you mean, making something like SQL built directly into the language as a first class citizen. Mumps did something like this https://en.wikipedia.org/wiki/MUMPS

mlhpdx 872 days ago

Like language integrated queries in c#?

https://learn.microsoft.com/en-us/dotnet/csharp/linq/

monero-xmr 875 days ago

All data query languages eventually reduce themselves into SQL, or something equivalent to it.

revscat 875 days ago

Ok? Even if that were true — and I’m not entirely sure how you would even prove that — SQL still lacks type safety, testability, and composability.

esafak 875 days ago

So use a better language, and let the compiler optimize it??

esafak 875 days ago

Doubtful, given my decades of experience and expected retirement age.

Working with 1000+-line SQL scripts written by other people is no fun. Why wouldn't you want to decompose that into legible, testable functions using an expressive language like Scala?

blowski 875 days ago

Being old doesn’t automatically make you more right. You don’t get wisdom as a birthday gift.

vladms 875 days ago

But if you did a lot of things you might know what does not work and why. Similar to science articles, nobody talks enough about "X technology does not work for Y domain", so a lot of people try to "reinvent the wheel" only to realize "X does not work for Y". Occasionally there is a surprise (because of some technology advancement) but being old definitely gives you more insight in what can go wrong.

jounker 875 days ago

On the other hand you often gain wisdom with experience, and often experience is proportional to age.

codr7 875 days ago

It does depend on experience, which everyone gets one way or the other.

I don't know about automatically, but definitely more likely.

eropple 875 days ago

At my current job I work with several people who have one to three years of experience repeated over the span of twenty. It might be less likely than some parts of HN would like it to be.

karmajunkie 875 days ago

no, but you do get experience, as in, when last somebody “invented” this idea thirty years ago, it was a total crapshoot, so wonder what’s changed?

oh right, new language. that’ll definitely fix it. :eyeroll:

esafak 875 days ago

So you're still using COBOL, FORTRAN and BASIC, since new languages don't fix things?

nprateem 875 days ago

I know I'm replying to a troll comment, but:

> “I have data and I know SQL. What is it about your database that makes retrieving it better?”

Because my data comes from a variety of unstructured, possibly dirty sources which need cleaning and transforming before they can be made sense of.

dijksterhuis 875 days ago

> Because my data comes from a variety of unstructured, possibly dirty sources which need cleaning and transforming before they can be made sense of.

Seattle data guy had a great end of year top 10 memes post recently and one of them went like this

> oh cool you’ve hired a data scientist. so you have a collection of reliable and easy to query data sources, right?

> …

> you do have a collection of reliable and easy to query data sources, right?

—-

Like, most of the time in businesses… if the data can’t be queried with SQL then it’s not ready to be used by the rest of the business. Whether that’s for dashboards, monitoring, downstream analytics or reporting. Data engineers do the dirty data cleaning. Data scientists do the actual science.

That’s what I took from the parent at least.

YMMV obviously depending on your domain. ML being a good example where things like end to end speech-to-text operates on wav files directly.

vismwasm 875 days ago

That's true. With dbt (=SQL+Jinja-Templating in an opionated framework) a large SQL codebase actually becomes maintainable. If in any way possible I'll usually load my raw data in an OLAP table (Snowflake, BigQuery) and do all the transforms there. At least for JSON data that works really well. Combine it with dbt tests and you're safe.

See https://www.getdbt.com/

monero-xmr 875 days ago

It's amazing that you think I'm trolling! The #1 way to get more customers of something as extreme as a new database is to use the tool that potential customers already know and have integrated into their systems. That's SQL. The same logic is for any new paradigm.

Ignore that statement, and fight the uphill battle.

justrealist 875 days ago

You have no idea how long the tail of legacy MR-based daily stat aggregation workflows is in BigCorps.

The batch daily log processor jobs will last longer than Fortran. Longer than Cobol. Longer than earth itself.

yjftsjthsd-h 875 days ago

> The batch daily log processor jobs will last longer than Fortran. Longer than Cobol.

Nonsense... They'll end at the same time. Which is approximately concurrently with the universe.

danvk 875 days ago

Or 2037.

codr7 875 days ago

It definitely played its role in high lighting what most functionish coders already knew: that map/filter/reduce is an awesome model for data processing pipelines.

dehrmann 875 days ago

Huge caveat to this: systems like Hadoop accomplish the parallelization by adding keys, and those FP constructs don't broadly have this. It's better to think of it as parallel divide-and-conquer.

nathants 875 days ago

the idea of map reduce remains a good one.

there are a number of interesting innovations in streaming systems that followed, mostly around reducing latency, reducing batch size, and failure strategies.

even hadoop could be hard to debug when hitting a performance ceiling for challenging workloads. the streaming systems took this even further, spark being notorious for fiddle with knobs and pray the next job doesn’t fail after a few hours, again.

i played around with the thinnest possible distributed data stack a while back[1][2]. i wanted to understand the performance ceiling for different workloads without all the impenetrable layers of software bureaucracy. turns out modern network and cpu are really fast when you stop adding random layers like lasagna.

i think the future of data, for serious workloads, is gonna be bespoke. the primitives are just too good now, and the tradeoff for understandability is often worth the cost.

1. https://github.com/nathants/s4

2. https://github.com/nathants/bsv

dijit 875 days ago

I feel like a lot of the underlying concepts of mapreduce live large in multi-threaded applications even on a single machine.

Its definitely not a dead concept, I guess its not sexy to talk about though.

KaiserPro 875 days ago

From what I recall Map:Reduce was basically a weird representation of DAG based pipeline.

I think because it looked sorta like an automatic dictionary to multi-thread converter it became popular. But its pretty useless unless you know how to split up and process your data.

basically, if you can cut your data up into a queue, you can MapReduce. But, most pipelines are more complex than that, so you probably need a proper DAG with dependencies.

Dopameaner 875 days ago

The paradiagm is very much alive. Look at Query_then_fetch in elasticsearch