| Can confirm that running Spark at scale is difficult. Not even necessarily talking about scale of data or scale of performance, but organizational scale. Getting dozens or hundreds of engineers aligned around best practices, tooling and local development for Spark is both challenging and extremely rewarding. When you have everyone buy into Spark as not just an execution environment but a programming paradigm, it really unlocks some cool potential. If anyone cares this is how I've found to best get Spark users riding on rails: * Use a monorepo to "namespace" different projects/teams/whatever. Each namespace has its own build.sbt for Scala jobs and Conda/Pip requirements file for PySpark. This gives you package isolation so that different projects can bump requirements at their own pace. This is crucial in larger organizations where you might have more siloed development or more legacy applications. * Build each project in the monorepo into a separate Docker image and tag it accordingly with some combination of the branch and namespace. * Deploy applications onto Kubernetes by invoking the SparkOperator (https://github.com/GoogleCloudPlatform/spark-on-k8s-operator), This abstracts away a lot of the hassle of driver/executor configuration and gives you nice out-of-the-box functionality for scraping Spark metrics. * For local development, use some type of CLI or Makefile to build/run the image locally. This is where the implementation diverges somewhat from using SparkOpelrator (unless you want to tell your employees that everyone needs to run Kubernetes on their local machine, which we thought would create too much friction). * For orchestration, write a custom operator for Airflow that submits a SparkOperator resource to the Kubernetes cluster of your choosing. The operator should supervise the application state, since the SparkOperator doesn’t quite do that well enough for you. This is something I wish we had the opportunity to open source. * Where it gets tricky is building Spark applications locally and running remotely, Say you built a job locally and tested it on a small subset of your data. Now you want to see what happens when you run across a full dataset, requiring more than 16gb of memory (or whatever the developer has on their laptop). You need some way to build your image locally but schedule it remotely. This could be done via the same CLI or Makefile, but you end up with a lot of images and it gets pretty costly. I’m sure we would have figured it out eventually if we didn’t all get laid off last month :P * BONUS: Use Iceberg or Delta (https://iceberg.apache.org/) (https://delta.io/). These are storage formats that work with distributed file storage like HDFS or S3 to partition and query data using the Spark DataFrame API. You get time travel, schema evolution and a bunch of other sweet features out of the box. They are an evolution of Hadoop-era partitioned file formats and are an absolute must for organizations dealing with lots of data & ML infrastructure. This post took up more time than I had wanted, but it actually feels good to write down before I forget. I hope it is useful for someone building Spark infrastructure. I'm sure others have a completely different approach, which I'd be curious to hear! As someone whose full time job was basically just to orchestrate Spark application development, I can say for certain products like this are needed in order for the ecosystem to thrive, and I would probably have given you my business had the circumstances been correct. Good luck to you and your team. |
Sorry to hear about the layoffs. I'd like to follow-up with you to get your feedback on specific roadmap items we have in mind. Would you email us at founders@datamechanics.co to schedule a call, or at least keep in touch for when we have an interesting feature/mockup to show you? Thanks and good luck as well!