|
|
|
|
|
by BenoitP
3692 days ago
|
|
I think your question is oriented towards X being a business problem. Netflix has users (say 100M) who have been liking some movies (say 100k). Say The question is: for every user, find movies he/she would like but have not seen yet. The dataset in question is large, and you have to answer this question with data regarding every user-movie pair (that would be 1e13 pairs). A problem of this size needs to be distributed across a cluster. Spark lets you express computations across this cluster, letting you explore the problem. Spark also provides you with a quite rich Machine Learning toolset [1]. Among which is ALS-WR [2], which was developped specifically for a competition organised by Netflix and got great results [3]. [1] http://spark.apache.org/docs/latest/mllib-guide.html
[2] http://spark.apache.org/docs/latest/mllib-collaborative-filt...
[3] http://www.grappa.univ-lille3.fr/~mary/cours/stats/centrale/... |
|