Hacker News new | ask | show | jobs
by iskander 4409 days ago
I'm really curious to find out in what situations Spark actually works for people. So far, no one in my lab seems to be having a terribly productive time using it. Maybe it's better for simple numerical computations? How large are the datasets you're working with?
1 comments

I did most of my benchmarking with the 10M MovieLens Dataset http://grouplens.org/datasets/movielens/ consisting of 10 million movie ratings on 10,000 movies from 72,000 users. So not necessarily "big data", but big enough to warrant a distributed approach.

Spark is ideally suited for iterative, multi-stage jobs. In theory, anything that requires doing multiple operations an a working dataset (i.e. graph processing, recommender systems, gradient descent) will do well on Spark due to the in-memory data caching model. This post explains some of the applications Spark is well-suited for: http://www.quora.com/Apache-Spark/What-are-use-cases-for-spa...

So the central piece of data is something like a 10 million element RDD of (UserId, (MovieId, Rating))? If so, it sounds like that data would fit into a single in-memory sparse array, how does Spark's performance compare with a local implementation?

By comparison, I'm trying (and failing) to work with RDDs of 100+ billion elements.

What is the difference between Spark and Storm? They both seem like "realtime compute engines"

*edit - from what I can see Spark is a replacement for hadoop (offline jobs), where Storm deals with online stream processing

Storm is generally more of a dataflow "per event" real/near time computation system (with each event flowing through N spouts and bolts) whereas Spark is more of an in-memory data processing system (with Spark streaming being the "equivalent" to the storm system).