Hacker News new | ask | show | jobs
by __vb__ 2232 days ago
databricks has other optimizations on top of open source spark version, are you maintaining your own version of spark or using the vanilla version of spark.

One thing I constantly deal with is how to optimize spark, how to use ganglia and spark ui to dig into what is causing data skew and slowness while running jobs. Is this something that you do better than databricks?

1 comments

Spark versions: Only vanilla (open source) Spark. But we offer a list of pre-packaged Docker images with useful libraries (e.g. for ML or for efficient data access) for each major Spark version. You can use them directly or build your own docker image on top of them.

Optimization/Monitoring: This topic is very important to us, thanks for bringing it up. Indeed we automatically tune configurations, but developers still need to understand the performance of their app to write better code. We're working on a Spark UI + Ganglia improvement (well, replacement really), which we could potentially open source.

Would you mind emailing me (jy@datamechanics.co) or even scheduling a call with me (https://calendly.com/b/datamechanics/avk7bhxq) so I show you what we have in mind and get your feedback? Anyone else interested is welcome to do the same.