Hacker News new | ask | show | jobs
by superyesh 1912 days ago
>Arc is an opinionated framework for defining predictable, repeatable and manageable data transformation pipelines;

I am confused by the title `Arc, an open-source Databricks alternative `. One of the main benefits of Databricks is the managed Spark. This isn't replacing Databricks as such probably giving an alternative to one of the features in Databricks.

2 comments

Yeah, agreed. I was a Databricks skeptic when I first came across it, but it's value goes a LONG way beyond just managing Spark.

For example, we found that Databrick's Spark (or their 'Delta engine' or whatever it's called) had 50-60% better performance on our workloads than than 'core' Spark. I guess that's not surprising when a large proportion of Spark contrionutors work for you and can performance tune! Not to mention things like MLFlow and all their data engineering stuff.

This is a cool project, and I admire it's ambition, but saying it's a real 'alternative' to Databricks is a bit disingenuous.

Databricks writes some good tools, but it can get pretty expensive. Kubeflow has been evolving well and is gaining lots of traction. It's pretty neat from my experience so far.
We provide multiple Docker images (https://github.com/orgs/tripl-ai/packages) that make the Spark deployment easy:

- arc-jupyter: allows you to develop on your local machine (and offline) or you can easily integrate it with a JupyterHub deployment on Kubernetes (https://zero-to-jupyterhub.readthedocs.io/en/stable/index.ht...). We have built JupyterHub on GCP Kubernetes (GKE) with full user-level auth via GCP IAM. If anyone is interested I can publish a secrets-removed version of our script.

- arc: is the execution only docker image (so is smaller than arc-jupyter). We have this orchestrated on Kubernetes too and now that Spark officially supports Kubernetes deployment it is actually really easy to create and destroy clusters on demand.