| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by ap22213 3578 days ago

"While running on 20 TB of input, we discovered that we were generating too many output files (each sized around 100 MB) due to the large number of tasks."

I could be completely missing something here, but to decrease the number of output files you can coalesce() the RDD before you write. For example, let's say you have a 20 node cluster, each with 10 executors, and the RDD action is split into 20000 tasks. You may end up with 20000 partitions (or more). However, you can coalesce, and reduce the number down to 200 partitions. Or, if necessary, you could even shuffle (across VMs but within the node) down to 20 partitions, if you're really motivated.

What am I missing?

3 comments

ap22213 3578 days ago

"Remove the two temporary tables and combine all three Hive stages into a single Spark job that reads 60 TB of compressed data and performs a 90 TB shuffle and sort."

"As far as we know, this is the largest real-world Spark job attempted in terms of shuffle data size"

I'm far, far from a world class engineer, but I regularly do 90 TiB shuffle sorts. I must seriously be missing something, here.

link

tveita 3578 days ago

Have you run into any of the issues mentioned of the article? Some of them are regressions, which version of Spark were you running?

Out of the linked issues these all seem like they would be "easy" to hit given enough data:

https://issues.apache.org/jira/browse/SPARK-13279

https://issues.apache.org/jira/browse/SPARK-13850

https://issues.apache.org/jira/browse/SPARK-13958

https://issues.apache.org/jira/browse/SPARK-14363

link

sigzero 3578 days ago

Are you using Spark? That's the context.

link

ap22213 3578 days ago

Yes

link

liancheng 3577 days ago

Coalescing down to a smaller partition number does decrease the number of output files. But it also decreases parallelism, which isn't expected when processing so large a dataset.

Coalescing makes more sense when some stage of the pipeline dramatically shrinks the amount of data (e.g. grep-ing error logs from all log files) so that successive stages can easily handle the rest of the data with much fewer executors.

(disc.: Spark committer)

link

placeybordeaux 3578 days ago

Yeah I didn't quite get that part either.

link