Hacker News new | ask | show | jobs
by mlthoughts2018 2668 days ago
On the point about autograd tools, it’s unfortunately not always helpful for types of models that are not amenable to that type of framework, like gradient-free methods, custom Bayesian inference models, or modified versions of some traditional models (like bias-corrected logistic regression).

Here again, if you tie machine learning to a big system like Spark, which is typically a huge IT cost in a lot of companies, and if they commit to having an underlying data model suitable for Spark, it necessitates orienting everything around Spark (someone with the scale of Microsoft might not suffer this problem like everybody else)... all together it just renders Spark to be usually such a limiting choice as to make it totally impractical to standardize on it and give up all the other types of models or data storage and cluster computing techniques you might need on a project-by-project basis.

1 comments

I definitely would agree that you should pick the best tool for the job and not limit yourself to one ecosystem if it's too difficult.

One way to make Spark a bit easier to work with is through kubernetes or a tool like databricks that provides it as a service. Kubernetes, in particular, provides you a really nice amount of flexibility and composability when designing systems. One thing that we created to try to fill the gap of having to integrate System X with Spark was HTTP on Spark. This makes it easy to integrate Spark with other tools in a microservice architecture. When you couple this with containers, you can do a lot very quickly.

For datatypes I would look into different Spark connectors, these days there one for almost every database/streaming service/ cloud store under the sun.

This being said, Spark is a large piece of software that uses many different programming concepts which can be daunting. Our goal is to try to listen to feedback like this so we can try to make the Spark ecosystem a bit easier to use for everyone.

This is actually the part I most disagree with. The overhead of connectors to Spark, particularly any use of py4j, is far too limiting except in cases when the data workload is so large that it effectively amortizes the overhead. For small scale prototypes, it’s a disaster, and then separate there are concerns for data type marshaling through the JVM when you may have a Python-only data model.

At the time of evaluation for me, I also found Databricks had extremely limited support for runtime environments defined by arbitrary containers. You have to select cluster nodes according to their prescribed images and choices.

Say you need Tensorflow or CUDA compiled with a weird set of optimization flags, or you need other special provisions in the runtime environment. In fact, variations of the runtime environment may even be part of some reproducible experiments, so you need to execute across a variety of parameters that govern how the runtime is built.

Anything that can’t support this type of stuff out of the box is just not worth it. Anybody can hook a notebook environment up to analyze data from some data warehouse or distributed file system.

The hard part is always how to make that setup configurable and parameterizable across the needs of different projects, especially arbitrary runtime environments.