Hacker News new | ask | show | jobs
by madhadron 2429 days ago
McSherry et al's paper "Scalability! But at what COST?" is worth reading. A single threaded, single core implementation typically outperforms Spark.

The best rule of thumb I'm aware of is: unless you can't fit your computation on a single machine or your jobs are likely to fail before completing from the size and length involved, you are generally better off without Spark or similar systems. And if sampling can get you back onto a single machine, then you're really better off.

2 comments

In my experience too I observed that distributed code introduces a lot of redundancy and it requires a lot of data to beat the performance of a single-threaded/single machine implementation. Check out McSherrys' Timely Dataflow, it is truly an amazing piece of work.
I expected that there is overhead to distributing the computation, but I was surprised by the magnitude of the speedups available.