| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by evancasey 4409 days ago
	My experience using spark has also been nothing but positive. I recently built a similarity-based recommendation on Spark (https://github.com/evancasey/sparkler), and found it to be significantly faster than comparable implementations on Hadoop. subprotocol's point about specifying the number of tasks/data partitions to use is true - you need to manually set this in order to get good results even on a small dataset. However, other than that, spark will give you good results pretty much out of the box. More advanced features such as broadcast objects, cache operations, and custom serializers will further optimize your application, but are not critical when first starting out as the author seems to believe.

1 comments

iskander 4409 days ago

I'm really curious to find out in what situations Spark actually works for people. So far, no one in my lab seems to be having a terribly productive time using it. Maybe it's better for simple numerical computations? How large are the datasets you're working with?

link

evancasey 4409 days ago

I did most of my benchmarking with the 10M MovieLens Dataset http://grouplens.org/datasets/movielens/ consisting of 10 million movie ratings on 10,000 movies from 72,000 users. So not necessarily "big data", but big enough to warrant a distributed approach.

Spark is ideally suited for iterative, multi-stage jobs. In theory, anything that requires doing multiple operations an a working dataset (i.e. graph processing, recommender systems, gradient descent) will do well on Spark due to the in-memory data caching model. This post explains some of the applications Spark is well-suited for: http://www.quora.com/Apache-Spark/What-are-use-cases-for-spa...

link

iskander 4409 days ago

So the central piece of data is something like a 10 million element RDD of (UserId, (MovieId, Rating))? If so, it sounds like that data would fit into a single in-memory sparse array, how does Spark's performance compare with a local implementation?

By comparison, I'm trying (and failing) to work with RDDs of 100+ billion elements.

link

platz 4409 days ago

What is the difference between Spark and Storm? They both seem like "realtime compute engines"

*edit - from what I can see Spark is a replacement for hadoop (offline jobs), where Storm deals with online stream processing

link

ironchef 4409 days ago

Storm is generally more of a dataflow "per event" real/near time computation system (with each event flowing through N spouts and bolts) whereas Spark is more of an in-memory data processing system (with Spark streaming being the "equivalent" to the storm system).

link