Hacker News new | ask | show | jobs
by subprotocol 4398 days ago
May be of relevance: https://cwiki.apache.org/confluence/display/SPARK/Powered+By...

I don't know what you would count as major deployment, but I've deployed a 30-node cluster on HW for running sub-second real-time adhoc queries. I've also run many smaller 10-20 node virtual clusters on open stack. It is a rock solid platform. Our hosted ops loves it because it just works.

The amazing thing about spark is how insanely expressive and hackable it is. The best way I can describe it is this:

* Hadoop: You spend all of your time telling it how to do what you want (it is the assembly language of bigdata)

* Spark: you spend your time telling it what you want, and it just does it

2 comments

> It is a rock solid platform. Our hosted ops loves it because it just works.

I'm really confused by how different our experiences have been. Above you said you wrote your own shuffle implementation, presumably that was prompted by poor performance at some point. And, when you encountered that poor performance, you presumably also saw what happens to Spark when it's overwhelmed: a sea of exceptions. In a short period of time I've encountered lots of the following:

- FileNotFound exceptions when shuffle files couldn't be created

- Too many open file handles (also related to shuffle files)

- Infinite procession of out-of-memory errors on a cluster with 12TB of memory.

- Executor disconnected

- Weird akka errors

- Mysterious serialization errors (Map.values isn't serializable, making a nested partitioner class for some reason didn't work)

These errors are sometimes recoverable and other times kill all the workers on the cluster. Did none of these things happen to your team?

This does help actually. And yes: it doesn't have to be a 1000 node cluster or anything crazy. I've just talked to a lot of people at bigger companies and they've all said it falls over yet.

Great to hear success stories!