Hacker News new | ask | show | jobs
by ojnabieoot 2232 days ago
Speaking as someone who might be in your target audience: my experience with Databricks (back in 2017/2018, without Kubernetes) is that their product is just as unreliable and frustrating as deploying a Spark cluster manually, but also more expensive and more time-consuming. It was so bad that I was wondering if the entire company was a scam - which isn't true, of course. I suspect a big part of our problem was a shuffle-heavy workload hitting a relatively new product. But it left a really bad taste in my mouth about the entire business model of "Spark as a Service."

My impulse reaction to your sales pitch is "their product probably doesn't work very well and is way too expensive." I know that's unfair, but this entire idea of "our platform automates away the tedium of Spark clusters" just strikes me as a bag of magic beans.

What would help a lot with drawing cynical, bitter people like me: case studies on your website. I know that's a lot to ask for a young startup. But actual details about either money or developer time saved with Data Mechanics - specific pains your customers were having and how Data Mechanics addressed them, or specific analyses your customers were able to do now that they're spending less time managing Spark. Running a big Spark job in the cloud is a huge financial risk, and many Spark users are much more concerned about this than the headaches involved with management - and again, my last experience with Databricks resulted in more cost and more headaches. I do not think I am alone here.

I am wondering if you're considering selling your Spark telemetry/parameter tuning/etc software, or offering it as a service, etc. Speaking personally, I would be much more open to using Data Mechanics's tools on my own Spark cluster rather than outsource the actual management. At my organization, in addition to AWS, we also have a local Hadoop cluster with Spark installed; commercial software that gives better insight into its performance could be very useful.

3 comments

Shuffling in Spark works well for small datasets, but is not reliable for large datasets because fault tolerance in Spark is incomplete. For example, check this Jira:

https://issues.apache.org/jira/browse/SPARK-20178

So, if your problem was mainly due to shuffle-heavy workload, then I guess no managed Spark service would be able to alleviate/eliminate it by automatic parameter tuning. In other words, your pain might be due to a fundamental problem in Spark itself.

IMO, Spark is great, but its speed is no longer its key strength. For examples, Hive is much faster than SparkSQL these days.

It's worse than that. Shuffle for Spark on Kubernetes is fundamentally broken and hasn't yet been fixed. The problem is that Docker containers cannot (for security reasons) share the same host-level disks. There is no external shuffle service, and disk-caching is container-local (not using kernel-level disk I/O buffering) which kills performance. Google's proposed soln below is to use NFS to store shuffle files, which is not going to be performant. Stick with YARN for Spark and only switch when shuffle is fixed for k8s. Databricks are in no rush to get shuffle fixed for k8s.

References: https://youtu.be/GbpMOaSlMJ4?t=1617 https://t.co/KWDNHjudfY?amp=1 https://issues.apache.org/jira/browse/SPARK-25299

I agree that Spark on Kubernetes will have a hard time fixing the problem of shuffling. If they choose to use local disks for per-node shuffle service, a performance issue arises because disk-caching is container-local. If they choose to use NFS to store shuffle files, a different kind of performance issue arises because of not using local disks for storing intermediate files. All these issues will arise without properly implementing fault tolerance in Spark.

We are currently trying to fix the first problem in a different context (not Spark), where worker containers store intermediate shuffle files in local disks mounted as hostPath volumes. The performance penalty is about 50% compared with running everything natively. Besides occasionally some containers almost get stuck for a long time. I believe that the Spark community will encounter the same problem in the future if they choose to use local disks for storing intermediate files.

Glad our post sparked some pretty deep discussions on the future of spark-on-k8s ! The OS community is working on several projects to help this problem. You've mentioned NFS (by Google) but there's also the possibility to use object storage. Mappers would first write to local disks, and then the shuffle data would be async moved to the cloud.

Sources: - end of presentation https://www.slideshare.net/databricks/reliable-performance-a... - https://issues.apache.org/jira/browse/SPARK-25299

I completely moved away from spark into snowflake due to this reason. It's failure modes seem to become far more predictable and you learn to become significantly more productive with it even though it's all pure SQL
Thanks for the detailed feedback. Spark can sometimes be frustrating. Automated tuning has a major impact but it is no silver bullet, sometimes a stability/performance problem lays in the code or the input data (partitioning).

That's why we're working on new monitoring solution (think Spark UI + Node metrics) to give Spark developers the much needed high-level feedback on the stability and performance of their apps. We'd like to make this work on top of other data platforms (at least the monitoring part, the automated tuning would be much harder).

Case studies: Thanks, we're working on them. Check our Spark Summit 2019 talk (How to automate performance tuning for Apache Spark) for the analysis of the impact at one of our customers.

Over the last year there's been a significant amount of low level changes in the proprietary versions of spark (aka EMR and Databricks) designed to address reliability and stability. Out of curiosity what exceptions did you run into?