| HN Mirror

This is actually the part I most disagree with. The overhead of connectors to Spark, particularly any use of py4j, is far too limiting except in cases when the data workload is so large that it effectively amortizes the overhead. For small scale prototypes, it’s a disaster, and then separate there are concerns for data type marshaling through the JVM when you may have a Python-only data model.

At the time of evaluation for me, I also found Databricks had extremely limited support for runtime environments defined by arbitrary containers. You have to select cluster nodes according to their prescribed images and choices.

Say you need Tensorflow or CUDA compiled with a weird set of optimization flags, or you need other special provisions in the runtime environment. In fact, variations of the runtime environment may even be part of some reproducible experiments, so you need to execute across a variety of parameters that govern how the runtime is built.

Anything that can’t support this type of stuff out of the box is just not worth it. Anybody can hook a notebook environment up to analyze data from some data warehouse or distributed file system.

The hard part is always how to make that setup configurable and parameterizable across the needs of different projects, especially arbitrary runtime environments.