| I like the analysis, basically it says "hey you don't have big data" :-) but that requires a bit more explanation. The only advantage of clustered systems like Spark, Hadoop, and others is aggregate bandwidth to disk and memory. We know that because Amdahl's law tells us that parallelizing something invariably adds overhead. So from a systems perspective that overhead has to be "paid for" by some other improvement, and we'll see that it is access to data. If your task is to process a 2TB data set, on a single machine using a 6GBS SATA channel and 2TB of FLASH SSDs you can read in that dataset into memory in 3333 seconds (at 600MB/sec which is very optimistic for our system), process it, and lets say you write out a 200GB reduced data set for another 333 seconds. so, conveniently, an hour of I/O time. Now you take that same dataset and distribute it evenly across a thousand nodes. Each one then has 2GB of the data on it. Each node can read in their portion of the data set in 3 seconds, process it and write out their reduction in .3 seconds. You have "paid" for the overhead of parallelization by trading an I/O cost of an hour for an I/O cost of about 4 seconds. That is when parallel data reduction architectures are better for a problem than a single threaded architecture. And that "betterness" is purely artificial in the sense that you would be better off with a single system that had 1,000 times the I/O bandwidth (cough mainframe cough) than 1,000 systems with the more limited bandwidth. However a 1,000 machines with one SSD it still cheaper buy than one mainframe of similar capability. So if, and its a big if, your algorithm can be expressed as a data map / reduce problem, and your data is large enough to push the cost of getting it into memory to have a look at significantly beyond cost of executing the program, then you can benefit positively by running it on a cluster rather than running it on a local machine. |
Still, I think the general rule applies that if you can buy a server that will fit your dataset into RAM, probably you don't need something like Hadoop.