Hacker News new | ask | show | jobs
by ogrisel 4132 days ago
While fitting the algorithm might not often benefit from partitioned data, I see two upsides from using spark for predictive modeling.

First it makes it easy to do the feature extraction and model fitting in the same pipeline, hence make it possible to cross-validate the impact of the hyper-parameters of the feature extraction part. Feature extraction generally starts from a collection of large, raw datasets that needs to be filtered, joined and aggregated (for instance a log of user clicks, sessionized by user id over temporal windows, then geo-joined to GIS data via a geoip resolution of the IP address of the user agent). While the raw datasets of clicks and geographical databases might be too big to be processed efficiently on a single node, the resulting extracted features (e.g. user session statistics enriched with geo features) is typically much slower and could be processed on a single node to build a predictive model. However spark RDDs make it natural to trace the provenance hence trivial to rebuild downstream models when tweaking upstream operations used to extract the features. The native caching features of Spark make that kind of workflow very efficient with minimal boilerplate (e.g. no manual file versionning).

Second, while the underlying ML algorithm might not always benefit from parallelization in itself, there are meta-level modeling operations that are both CPU intensive and embarrassingly parallel and therefore can benefit greatly from a compute cluster such as Spark. The canonical case are cross validation and hyper-parameter tuning.