|
|
|
|
|
by evancasey
4410 days ago
|
|
I did most of my benchmarking with the 10M MovieLens Dataset http://grouplens.org/datasets/movielens/ consisting of 10 million movie ratings on 10,000 movies from 72,000 users. So not necessarily "big data", but big enough to warrant a distributed approach. Spark is ideally suited for iterative, multi-stage jobs. In theory, anything that requires doing multiple operations an a working dataset (i.e. graph processing, recommender systems, gradient descent) will do well on Spark due to the in-memory data caching model. This post explains some of the applications Spark is well-suited for: http://www.quora.com/Apache-Spark/What-are-use-cases-for-spa... |
|
By comparison, I'm trying (and failing) to work with RDDs of 100+ billion elements.