Hacker News new | ask | show | jobs
by potatoyogurt 2897 days ago
Best fit is something you can argue a lot about. There are a lot of data processing tools out there now, many of which have come out after Hadoop. But if the comparison is against some process running on a single machine, then the use case is not narrow. It includes basically anything where you're processing more than 1TB of data in non-trivial ways (i.e. not just a map operation) and are okay with batch processing.
1 comments

The issue with that is that for most organizations their key business data (and all its recorded history) fits in RAM of a sufficiently beefy workstation. They want to call it Big Data to stroke their egos, and properly acquiring, cleaning and integrating that data can take a LOT of effort so that data can be quite expensive and worthy of any glorious label they can think of; but my experience is that processing more than 1TB of meaningful data actually is a narrow use case, which matters in two specific categories: the (relatively few) very large multinational companies, and processing of raw video/audio/image data; and the majority of people working on data analysis end up with business needs that can be satisfied by relatively simple methods on relatively small datasets.
I agree in general. But I think you underestimate how large the set of use cases are where people are processing > 1TB of data. This includes quite a bit of the adtech industry, for instance, even many startups. It also includes data warehouses for other industries, such as in health tech. Of course, these people generally are experts, since it is their core business, so they know well what tools they need. I agree that for some analytics department in a random company whose core business isn't processing data, Hadoop is more likely to be a resume item than something that's really needed.
A lot of those > 1TB data sources are very standardized, they can be mapped to a schema, in which case indexing the data supports interactive queries and analytics. Hadoop seems well suited for data that needs to be processed in very different and changing ways.
I think this is the other point that's so often missed/ignored in the "big data" discussion: there's a middle ground between everything-fits-in-memory and must-be-distributed.

The Adam Drake article alludes to it only in the last sentence by mentioning traditional relational databases as an alternative.

For workloads that are relatively I/O-heavy and CPU-light, it's very hard to beat local SSDs (or even HDDs in enough quantity) attached to a single [1] node, if the competition is distributed storage attached by ethernet. It only takes a couple 600MB/s SSDs to saturate a 10GE. A server with 48 lanes of PCIe 3 slots could take the I/O of 78 of them.

100Gb/s networks are getting close. For upwards of $1k per server (NIC and switch port) one can bring that ratio closer to 4:1 from 39:1. I'd expect this is the attractive route for anyone with CPU needs that can't be met by a 4-socket server.

[1] Yes, there can be more than one node with copies for redundancy, as has been complained about elsewhere in the sub-thread, or even scalability

You're absolutely right. Probably my estimate of "anything greater than 1TB" was not quite the right number. At that point, though, careful indexing is probably the more expert option. It's easy to throw data into EMR now if you're not super concerned about performance/cost, and it doesn't require you to think through how your data will need to be indexed to support your future needs.
In my experience, people generally go the other way: they say “we need spark/Hadoop/Cassandra because we have Big Data” when they have a 30GB dataset that is best handled on a beefy EC2 instance with boring tools.