| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by lifeisstillgood 2446 days ago

>>> One of our main reasons for choosing Apache Spark had been its ability to handle very large datasets (larger than what you can fit into memory on a single node) and its ability to distribute computations over a whole cluster of machines ... We cannot fit all of our datasets in memory on one node, but that is also not necessary, since we can trivially shard datasets of different projects over different servers, because they are all independent of one another.

So this seems to be the massive takeaway - if you need to operate on a whole dataset that is larger than one node's memory capacity then you have to go distributed. Else it still seems an overhead barely worth the effort.

So Google: dataset is all web pages on the internet - yes that's too large go distributed.

Tesco / Walmart : dataset might be all the sales for a year. Probably too large. But could you do with sales per week? per day?

having the raw data of all your transactions etc lying around waiting for your spiffo business query sounds good but ... is it?

I would be interested in hearing folks' cut-off points for going full Big Data vs "we don't really need this"