Hacker News new | ask | show | jobs
by agibsonccc 4409 days ago
Spark is an interesting technology, from what I've heard it doesn't actually have traction in industry yet though.

Anyone here actually using it in production? I know it's blazing fast etc, and I like it as a map reduce replacement. It has all the makings of a great distributed system, I'm still waiting to see a major deployment yet..

5 comments

May be of relevance: https://cwiki.apache.org/confluence/display/SPARK/Powered+By...

I don't know what you would count as major deployment, but I've deployed a 30-node cluster on HW for running sub-second real-time adhoc queries. I've also run many smaller 10-20 node virtual clusters on open stack. It is a rock solid platform. Our hosted ops loves it because it just works.

The amazing thing about spark is how insanely expressive and hackable it is. The best way I can describe it is this:

* Hadoop: You spend all of your time telling it how to do what you want (it is the assembly language of bigdata)

* Spark: you spend your time telling it what you want, and it just does it

> It is a rock solid platform. Our hosted ops loves it because it just works.

I'm really confused by how different our experiences have been. Above you said you wrote your own shuffle implementation, presumably that was prompted by poor performance at some point. And, when you encountered that poor performance, you presumably also saw what happens to Spark when it's overwhelmed: a sea of exceptions. In a short period of time I've encountered lots of the following:

- FileNotFound exceptions when shuffle files couldn't be created

- Too many open file handles (also related to shuffle files)

- Infinite procession of out-of-memory errors on a cluster with 12TB of memory.

- Executor disconnected

- Weird akka errors

- Mysterious serialization errors (Map.values isn't serializable, making a nested partitioner class for some reason didn't work)

These errors are sometimes recoverable and other times kill all the workers on the cluster. Did none of these things happen to your team?

This does help actually. And yes: it doesn't have to be a 1000 node cluster or anything crazy. I've just talked to a lot of people at bigger companies and they've all said it falls over yet.

Great to hear success stories!

Yahoo was initially playing around with Spark. They opted for Tez on Yarn instead: http://yahoodevelopers.tumblr.com/post/85930551108/yahoo-bet...
Ooyala has a huge deployment that they use alongside their Cassandra cluster (something like ~100 nodes, and ~50TB of data IIRC)
I'd love to learn more about how they're using Spark, are there are any blog posts or tech talks floating around?
Here's a talk at Hakka Labs done by a Ooyala Engineer (@evanfchan), which is how I knew they used Spark: https://www.youtube.com/watch?v=PjZp7K5z7ew - and the accompanying slides: http://www.slideshare.net/planetcassandra/south-bay-cassandr...

They use Spark on top of Cassandra, as well as they are users of Spark's version of Hive - Shark.

Thanks for posting this. I'm starting to get a feel for when Spark is usable-- you need an underlying indexed data store which lets you fetch small subsets of your data into RDDs (or, your data can be tiny to begin with). We've been trying to use Spark on input sizes which, while smaller than our cluster's available memory, are probably too big for Spark to handle (> 1TB).
These guys look to be doing some nice work integrating Cassandra and Spark http://blog.tuplejump.com/ They've piggybacked on the Cassandra clustering using a java agent to run the Spark masters. Doesn't seem to be a realease available yet though.
eBay posted this two days ago...

Using Spark to Ignite Data Analytics ( http://www.ebaytechblog.com/2014/05/28/using-spark-to-ignite...)