Hacker News new | ask | show | jobs
by potatoyogurt 2898 days ago
I agree in general. But I think you underestimate how large the set of use cases are where people are processing > 1TB of data. This includes quite a bit of the adtech industry, for instance, even many startups. It also includes data warehouses for other industries, such as in health tech. Of course, these people generally are experts, since it is their core business, so they know well what tools they need. I agree that for some analytics department in a random company whose core business isn't processing data, Hadoop is more likely to be a resume item than something that's really needed.
2 comments

A lot of those > 1TB data sources are very standardized, they can be mapped to a schema, in which case indexing the data supports interactive queries and analytics. Hadoop seems well suited for data that needs to be processed in very different and changing ways.
I think this is the other point that's so often missed/ignored in the "big data" discussion: there's a middle ground between everything-fits-in-memory and must-be-distributed.

The Adam Drake article alludes to it only in the last sentence by mentioning traditional relational databases as an alternative.

For workloads that are relatively I/O-heavy and CPU-light, it's very hard to beat local SSDs (or even HDDs in enough quantity) attached to a single [1] node, if the competition is distributed storage attached by ethernet. It only takes a couple 600MB/s SSDs to saturate a 10GE. A server with 48 lanes of PCIe 3 slots could take the I/O of 78 of them.

100Gb/s networks are getting close. For upwards of $1k per server (NIC and switch port) one can bring that ratio closer to 4:1 from 39:1. I'd expect this is the attractive route for anyone with CPU needs that can't be met by a 4-socket server.

[1] Yes, there can be more than one node with copies for redundancy, as has been complained about elsewhere in the sub-thread, or even scalability

You're absolutely right. Probably my estimate of "anything greater than 1TB" was not quite the right number. At that point, though, careful indexing is probably the more expert option. It's easy to throw data into EMR now if you're not super concerned about performance/cost, and it doesn't require you to think through how your data will need to be indexed to support your future needs.
In my experience, people generally go the other way: they say “we need spark/Hadoop/Cassandra because we have Big Data” when they have a 30GB dataset that is best handled on a beefy EC2 instance with boring tools.