Hacker News new | ask | show | jobs
by lightcatcher 3578 days ago
I tried (and failed after a month of work) earlier this year trying to do ~20TB shuffles with Spark. I felt both relief and frustration reading this post.

Relief: I'm not an idiot, and the problems in the shuffle were likely in Spark and not just me being a beginner user.

Frustration: I wanted to group about 100 billion x 200 byte records into 5 billion groups. This seems like exactly the problem Spark is designed for and is advertised for. I had great difficulty even getting example Spark SQL code (or my own RDD based code) working. I hear so many great things about Spark as a tool for big data, and also "your data isn't big". I considered 20TB on the "low end" of big data, but a seemingly popular and widely used big data tool can't shuffle it without numerous bug fixes and pain on the part of the user. Shuffling 90TB was worth a Facebook blog post! This makes me ask: To all the people using Spark for "big data", how painful is it and how much data are you handling? It appears the answer for >=20TB is "very painful", and for <=5TB I think you're generally in "handle on single node" territory.

1 comments

One of the hardest parts for me was getting the cluster sized appropriately so that all data stayed in memory. Overflow to disk slows things down a lot. But, sizing the cluster can be tricky if you're generating a lot of data structures in the tasks.

I only use RDDs, so I put a lot of thought into the processing flow so that unnecessary data wasn't shuffled. If you're reducing PairRDDs, make sure that the data is evenly distributed. Also, I'm guessing you read the optimization docs, but a huge amount of network I/O can be reduced by choosing the right types and collections and optimizing serialization. And, of course group within partitions first, then within nodes, then across nodes. And, of course, go for fewer bigger servers with lots of network bandwidth.

There are a lot of tricks, unfortunately. And, since I don't know your experience level I won't bore you with things you probably already know.

I have no practical experience with Spark, but I was wondering where Alluxio fits into what you're describing.

Is it only applicable once the "cluster [is] sized appropriately so that all data stayed in memory" as you mention?

Alluxio can cache objects in memory that are used downstream in other steps or even other jobs.

In Spark, the biggest performance gains are realized when all data stays in memory. But, the Spark memory architecture is kind of opaque. It's pretty well documented, and there are some good blogs out there, but there are a lot of factors in play.

For example, let's say we start with 1 TB of compressed data and take a guess that it will uncompress to 10 TB. So, we do a naive calculation and guess that we'll need 20 nodes, each with 500 GB of memory.

For our example, Spark will be managed by Yarn. Yarn is going to take a portion of each node's memory. I forget how much, but let's say 10%. Then, depending on how many executors we have per node, each will also have memory overhead. Again, I don't know exactly, but let's just say another 10%. Then, some memory will be dedicated to system processes, etc. Then there's the Spark driver. So, after all of that, let's say 25-30% of memory is used outside of the executor. So, that leaves us with 10 executors per node, each with ~35 GB of memory. For maximum performance, we have to ensure that all 10 TB fits across all executors. So, we may actually need 29 nodes, or a lot of data is going to spill to disk.

Then, let's say that our processing job doesn't evenly distribute data across partitions. Some partitions get 5x as much data as other partitions. Again, data spills to disk - or the VM runs out of heap. Or, maybe our 10% compression estimate is off by 5% - more spills to disk.

I think Alluxio helps with the initial data read in the first stage, but subsequent stages' datasets are stored in memory anyway.