Hacker News new | ask | show | jobs
by liancheng 3571 days ago
Coalescing down to a smaller partition number does decrease the number of output files. But it also decreases parallelism, which isn't expected when processing so large a dataset.

Coalescing makes more sense when some stage of the pipeline dramatically shrinks the amount of data (e.g. grep-ing error logs from all log files) so that successive stages can easily handle the rest of the data with much fewer executors.

(disc.: Spark committer)