Hacker News new | ask | show | jobs
by itronitron 2903 days ago
A lot of those > 1TB data sources are very standardized, they can be mapped to a schema, in which case indexing the data supports interactive queries and analytics. Hadoop seems well suited for data that needs to be processed in very different and changing ways.
2 comments

I think this is the other point that's so often missed/ignored in the "big data" discussion: there's a middle ground between everything-fits-in-memory and must-be-distributed.

The Adam Drake article alludes to it only in the last sentence by mentioning traditional relational databases as an alternative.

For workloads that are relatively I/O-heavy and CPU-light, it's very hard to beat local SSDs (or even HDDs in enough quantity) attached to a single [1] node, if the competition is distributed storage attached by ethernet. It only takes a couple 600MB/s SSDs to saturate a 10GE. A server with 48 lanes of PCIe 3 slots could take the I/O of 78 of them.

100Gb/s networks are getting close. For upwards of $1k per server (NIC and switch port) one can bring that ratio closer to 4:1 from 39:1. I'd expect this is the attractive route for anyone with CPU needs that can't be met by a 4-socket server.

[1] Yes, there can be more than one node with copies for redundancy, as has been complained about elsewhere in the sub-thread, or even scalability

You're absolutely right. Probably my estimate of "anything greater than 1TB" was not quite the right number. At that point, though, careful indexing is probably the more expert option. It's easy to throw data into EMR now if you're not super concerned about performance/cost, and it doesn't require you to think through how your data will need to be indexed to support your future needs.