Hacker News new | ask | show | jobs
by mhamilton723 2668 days ago
I definitely would agree that you should pick the best tool for the job and not limit yourself to one ecosystem if it's too difficult.

One way to make Spark a bit easier to work with is through kubernetes or a tool like databricks that provides it as a service. Kubernetes, in particular, provides you a really nice amount of flexibility and composability when designing systems. One thing that we created to try to fill the gap of having to integrate System X with Spark was HTTP on Spark. This makes it easy to integrate Spark with other tools in a microservice architecture. When you couple this with containers, you can do a lot very quickly.

For datatypes I would look into different Spark connectors, these days there one for almost every database/streaming service/ cloud store under the sun.

This being said, Spark is a large piece of software that uses many different programming concepts which can be daunting. Our goal is to try to listen to feedback like this so we can try to make the Spark ecosystem a bit easier to use for everyone.

1 comments

This is actually the part I most disagree with. The overhead of connectors to Spark, particularly any use of py4j, is far too limiting except in cases when the data workload is so large that it effectively amortizes the overhead. For small scale prototypes, it’s a disaster, and then separate there are concerns for data type marshaling through the JVM when you may have a Python-only data model.

At the time of evaluation for me, I also found Databricks had extremely limited support for runtime environments defined by arbitrary containers. You have to select cluster nodes according to their prescribed images and choices.

Say you need Tensorflow or CUDA compiled with a weird set of optimization flags, or you need other special provisions in the runtime environment. In fact, variations of the runtime environment may even be part of some reproducible experiments, so you need to execute across a variety of parameters that govern how the runtime is built.

Anything that can’t support this type of stuff out of the box is just not worth it. Anybody can hook a notebook environment up to analyze data from some data warehouse or distributed file system.

The hard part is always how to make that setup configurable and parameterizable across the needs of different projects, especially arbitrary runtime environments.