| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by vidarh 1771 days ago

Yes, but when talking about MapReduce we generally talk about distributed frameworks for doing it on big data.

But most people don't have "big data" in the sense of having data that requires more than a single machine to process.

Most people who think they have "big data" still don't have big data (e.g. I've done work on datasets where people insisted on using "big data" solutions when it could all easily fit in a Postgres instance with or without a columnar store with most of the working set cached in memory for a fraction of the cost).

It "went away" in the sense that more people realised they could avoid it with a few simple steps (e.g. pre-processing during ingestion), and/or fit the data they needed on fast-growing individual servers, and so the number of people continuing to use it more closer approximated the set of people who actually work on big data.

For those who actually needed it, it of course never went away.

2 comments

citrin_ru 1771 days ago

If all you have is a hummer everything look like a nail: when Hadoop first appeared there was almost no other open source systems to process 'big data' and it was widely adopted. Now there are many options to choose from. We don't have to use map-reduce for every task which could be solved using map-redude. E. g. for some tasks a columnar store, like ClickHouse is a better fit.

link

CRConrad 1771 days ago

If all you have is a Hummer everything looks like an enemy vehicle.

link

brundolf 1771 days ago

Forgive me if this is naive, but could smaller-scale cases be served by a version that uses the MapReduce model as a way to cleanly break up operations across cores instead of machines? Or do the benefits of the model become mostly irrelevant in that case?

I'm sure it wouldn't take the form of a dedicated process; probably just a language-agnostic programming pattern

link

dekhn 1771 days ago

most of the benefits go away, but yes, you can do this. MapReduce had a flag to use multiple cores for multiple workers on a machine and this was often the way to get the greatest throughput.

link