|
|
|
|
|
by mhamilton723
2668 days ago
|
|
MMLSpark is Microsoft’s open source initiative for advancing distributed computing in Apache Spark. MMLSpark provides deep-learning, intelligent microservices, model deployment, model interpretability, image-processing, and many other tools to help you build scalable and production quality machine learning applications. MMLSpark is usable from Python, Scala, Java, and R and can be used on any Spark cluster. For more information and setup instructions, see our github page: https://github.com/Azure/mmlspark Thank you! |
|
One thing that bugs me in particular is that it essentially presumes all workflows involve huge, distributed datasets. But most model development work, especially for projects that will eventually be trained for production using huge, distributed data sets, must begin their life cycle as small data prototypes with extremely low-overhead development cycles. I’ve never found a way to organize Spark so that it can handle both tiny WIP prototyping and also production ML workloads. Inevitably you end up with two separated tool stacks and lots of kludgy processes to migrate from a successful prototype over to a Spark implementation, because the repeated overhead of Spark in the small prototype phase is far too costly and limiting.
Other gripes include “bring your own container” solutions for controling the surrounding runtime details of different models on a project-by-project basis, and for Python at least, how to entirely bypass py4j and eliminate all JVM overhead, especially when working in cases that heavily rely on custom Python extension modules and extension types.
To boot, for use cases when you have to write your own likelihood functions for models that don’t already exist in Spark libraries, and potentially you need to get down to the level of tuning the actual optimizer you’ll use to train, Spark is extremely opaque and full of incomprehensible errors and debugging issues. Not to mention that ideally you’d like to write those routines once and be able to use them across any development environments (whether executing via Spark or not), it leads to incentives towards “vendor lock in” effects where you don’t even try things at all unless they are out of the box in Spark.
All this is to say, I wish that initiatives moving towards the very generous goal of open-sourcing ML tools to the community would view the idea of portability across different runtime environment / cluster computing / backend storage models to be a first-class requirement.