Hacker News new | ask | show | jobs
by BorisTheBrave 1955 days ago
I'm having a hard time understanding this article. It seems to be a bit too low level on the specifics of Beam for general consumption.

From what i undestand, Spark has the same feature built in. If the planner knows that the source data is partitioned and/or sorted appropriately, it can skip shuffling/sorting it, instead having each executor directly requesting the one file it needs.

It's a nice optimization, but it's not game changing. You often end up having to shuffle anyway, as you are joining on a different key, or for performance reason you need more executors than the set amount of partitions, or the shuffle needed to write the data doesn't justify the savings on the readers.

Maybe it's better with their additional optimizations? Spark does not do those, mostly.

2 comments

Hive also has had this optimization for as long as I can remember. As others have noted it's not particularly new or novel, it's just not part of the Beam SDK.
50% cost reduction though