Hacker News new | ask | show | jobs
by Barraketh 2429 days ago
I know that Spark has had a lot of work put into it, but my personal experience with it has been pretty negative. I've spent a lot of time at my job trying to tune it to our workflows (extremely deep queries), with only moderate success. I've just POC'd a custom SQL execution engine that was 200x faster than spark for the same workflows. Now, our requirements are pretty non-standard, but I find it pretty easy to believe these benchmarks.
2 comments

McSherry et al's paper "Scalability! But at what COST?" is worth reading. A single threaded, single core implementation typically outperforms Spark.

The best rule of thumb I'm aware of is: unless you can't fit your computation on a single machine or your jobs are likely to fail before completing from the size and length involved, you are generally better off without Spark or similar systems. And if sampling can get you back onto a single machine, then you're really better off.

In my experience too I observed that distributed code introduces a lot of redundancy and it requires a lot of data to beat the performance of a single-threaded/single machine implementation. Check out McSherrys' Timely Dataflow, it is truly an amazing piece of work.
I expected that there is overhead to distributing the computation, but I was surprised by the magnitude of the speedups available.
It is indeed my opinion too. In non-standard workflows, handcrafted code/application will most likely beat generic frameworks(not true for some cases). I have conflicting thoughts about this. Nowadays industries are very fast-moving, they generally can't afford to do it all for each of their use cases. So they tend to pick up generic frameworks. But I have seen many managers picking the wrong tools for the job and vastly overestimate their future needs. Everyone thinks that they are going to process petabytes of data, and they make the decision to use these generic distributed frameworks from the beginning to avoid the future scale. It rarely happens. Most of the time, they end up spending money on Cloud because making something distributed comes with a lot of redundancy to provide fault tolerance and yet not as performant as single machine performance due for data up to few TBs. Even here, if you take that parquet example, my hand-coded Rust code beats the Rust RDD version by 4x. I guess we can't change this attitude. So it is better to aim for improving these libraries.
I completely agree that we need better generic libraries. I was mostly commenting that I really believe that there are huge wins that can be achieved in the "generic distributed execution engine" space, and that people shouldn't be intimidated by the work that has already gone into spark.