|
|
|
|
|
by ojnabieoot
2232 days ago
|
|
Speaking as someone who might be in your target audience: my experience with Databricks (back in 2017/2018, without Kubernetes) is that their product is just as unreliable and frustrating as deploying a Spark cluster manually, but also more expensive and more time-consuming. It was so bad that I was wondering if the entire company was a scam - which isn't true, of course. I suspect a big part of our problem was a shuffle-heavy workload hitting a relatively new product. But it left a really bad taste in my mouth about the entire business model of "Spark as a Service." My impulse reaction to your sales pitch is "their product probably doesn't work very well and is way too expensive." I know that's unfair, but this entire idea of "our platform automates away the tedium of Spark clusters" just strikes me as a bag of magic beans. What would help a lot with drawing cynical, bitter people like me: case studies on your website. I know that's a lot to ask for a young startup. But actual details about either money or developer time saved with Data Mechanics - specific pains your customers were having and how Data Mechanics addressed them, or specific analyses your customers were able to do now that they're spending less time managing Spark. Running a big Spark job in the cloud is a huge financial risk, and many Spark users are much more concerned about this than the headaches involved with management - and again, my last experience with Databricks resulted in more cost and more headaches. I do not think I am alone here. I am wondering if you're considering selling your Spark telemetry/parameter tuning/etc software, or offering it as a service, etc. Speaking personally, I would be much more open to using Data Mechanics's tools on my own Spark cluster rather than outsource the actual management. At my organization, in addition to AWS, we also have a local Hadoop cluster with Spark installed; commercial software that gives better insight into its performance could be very useful. |
|
https://issues.apache.org/jira/browse/SPARK-20178
So, if your problem was mainly due to shuffle-heavy workload, then I guess no managed Spark service would be able to alleviate/eliminate it by automatic parameter tuning. In other words, your pain might be due to a fundamental problem in Spark itself.
IMO, Spark is great, but its speed is no longer its key strength. For examples, Hive is much faster than SparkSQL these days.