| HN Mirror

> It is a rock solid platform. Our hosted ops loves it because it just works.

I'm really confused by how different our experiences have been. Above you said you wrote your own shuffle implementation, presumably that was prompted by poor performance at some point. And, when you encountered that poor performance, you presumably also saw what happens to Spark when it's overwhelmed: a sea of exceptions. In a short period of time I've encountered lots of the following:

- FileNotFound exceptions when shuffle files couldn't be created

- Too many open file handles (also related to shuffle files)

- Infinite procession of out-of-memory errors on a cluster with 12TB of memory.

- Executor disconnected

- Weird akka errors

- Mysterious serialization errors (Map.values isn't serializable, making a nested partitioner class for some reason didn't work)

These errors are sometimes recoverable and other times kill all the workers on the cluster. Did none of these things happen to your team?