Hacker News new | ask | show | jobs
by barneso 3780 days ago
There are plenty of alternatives out there to Spark ML: here is a survey of RF implementations: https://github.com/szilard/benchm-ml/tree/master/z-other-too...

There is a whole other world of non stochastic gradient descent based algorithms out there; IMO Tensorflow is sensible to stick to one class of algorithms and do it well.

(Disclaimer: I work on mldb, one of the tools on that list).

2 comments

mldb looks great. But I was referring to distributed model building, in a horizontal way. Which SparkML does, and TensorFlow says it does. If they can implement a distributed Gradient Boosting Tree across nodes, maybe even with GPU support (Although I'm not sure if it's applicable) that could be huge.
Once the open source version of Tensorflow releases multi-node support, this would be one way to make it work. There are potential gains from using a GPU for RF training. As for distributing, in my experience for small models it doesn't make much difference and for larger models the cost of distributing the dataset dominates the benefit from having multiple nodes. But an implementation carefully designed for a given node topology could be made more performant.
Comparable AUC metric to xgboost and faster? That's... pretty interesting.

Does that include the dataload time into MLDB?

None of the systems include the data load time, but for mldb and the other non-distributed systems, it's only a few seconds.

(edit: my grammar is good not)