| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by liancheng 3618 days ago

Coalescing down to a smaller partition number does decrease the number of output files. But it also decreases parallelism, which isn't expected when processing so large a dataset.

Coalescing makes more sense when some stage of the pipeline dramatically shrinks the amount of data (e.g. grep-ing error logs from all log files) so that successive stages can easily handle the rest of the data with much fewer executors.

(disc.: Spark committer)