|
|
|
|
|
by mlthoughts2018
2668 days ago
|
|
On the point about autograd tools, it’s unfortunately not always helpful for types of models that are not amenable to that type of framework, like gradient-free methods, custom Bayesian inference models, or modified versions of some traditional models (like bias-corrected logistic regression). Here again, if you tie machine learning to a big system like Spark, which is typically a huge IT cost in a lot of companies, and if they commit to having an underlying data model suitable for Spark, it necessitates orienting everything around Spark (someone with the scale of Microsoft might not suffer this problem like everybody else)... all together it just renders Spark to be usually such a limiting choice as to make it totally impractical to standardize on it and give up all the other types of models or data storage and cluster computing techniques you might need on a project-by-project basis. |
|
One way to make Spark a bit easier to work with is through kubernetes or a tool like databricks that provides it as a service. Kubernetes, in particular, provides you a really nice amount of flexibility and composability when designing systems. One thing that we created to try to fill the gap of having to integrate System X with Spark was HTTP on Spark. This makes it easy to integrate Spark with other tools in a microservice architecture. When you couple this with containers, you can do a lot very quickly.
For datatypes I would look into different Spark connectors, these days there one for almost every database/streaming service/ cloud store under the sun.
This being said, Spark is a large piece of software that uses many different programming concepts which can be daunting. Our goal is to try to listen to feedback like this so we can try to make the Spark ecosystem a bit easier to use for everyone.