|
|
|
|
|
by ap22213
3578 days ago
|
|
"While running on 20 TB of input, we discovered that we were generating too many output files (each sized around 100 MB) due to the large number of tasks." I could be completely missing something here, but to decrease the number of output files you can coalesce() the RDD before you write. For example, let's say you have a 20 node cluster, each with 10 executors, and the RDD action is split into 20000 tasks. You may end up with 20000 partitions (or more). However, you can coalesce, and reduce the number down to 200 partitions. Or, if necessary, you could even shuffle (across VMs but within the node) down to 20 partitions, if you're really motivated. What am I missing? |
|
"As far as we know, this is the largest real-world Spark job attempted in terms of shuffle data size"
I'm far, far from a world class engineer, but I regularly do 90 TiB shuffle sorts. I must seriously be missing something, here.