| Having evaluated Spark extensively for my company’s ML use cases, I came away deeply disappointed. One thing that bugs me in particular is that it essentially presumes all workflows involve huge, distributed datasets. But most model development work, especially for projects that will eventually be trained for production using huge, distributed data sets, must begin their life cycle as small data prototypes with extremely low-overhead development cycles. I’ve never found a way to organize Spark so that it can handle both tiny WIP prototyping and also production ML workloads. Inevitably you end up with two separated tool stacks and lots of kludgy processes to migrate from a successful prototype over to a Spark implementation, because the repeated overhead of Spark in the small prototype phase is far too costly and limiting. Other gripes include “bring your own container” solutions for controling the surrounding runtime details of different models on a project-by-project basis, and for Python at least, how to entirely bypass py4j and eliminate all JVM overhead, especially when working in cases that heavily rely on custom Python extension modules and extension types. To boot, for use cases when you have to write your own likelihood functions for models that don’t already exist in Spark libraries, and potentially you need to get down to the level of tuning the actual optimizer you’ll use to train, Spark is extremely opaque and full of incomprehensible errors and debugging issues. Not to mention that ideally you’d like to write those routines once and be able to use them across any development environments (whether executing via Spark or not), it leads to incentives towards “vendor lock in” effects where you don’t even try things at all unless they are out of the box in Spark. All this is to say, I wish that initiatives moving towards the very generous goal of open-sourcing ML tools to the community would view the idea of portability across different runtime environment / cluster computing / backend storage models to be a first-class requirement. |
For the likelihood functions comment, I would totally agree. Autograd libraries are easier to build custom likelihood models in, which is why we created CNTK on Spark, and databricks created Tensorflow on Spark. These give you the flexibility of modern deep learning stacks with the elasticity of spark
But in the end Spark is a single tool in a collection of tools and might not be right for your project, but it's been good for a lot of our work here at MSFT :)!