|
|
|
|
|
by BorisTheBrave
1955 days ago
|
|
I'm having a hard time understanding this article. It seems to be a bit too low level on the specifics of Beam for general consumption. From what i undestand, Spark has the same feature built in. If the planner knows that the source data is partitioned and/or sorted appropriately, it can skip shuffling/sorting it, instead having each executor directly requesting the one file it needs. It's a nice optimization, but it's not game changing. You often end up having to shuffle anyway, as you are joining on a different key, or for performance reason you need more executors than the set amount of partitions, or the shuffle needed to write the data doesn't justify the savings on the readers. Maybe it's better with their additional optimizations? Spark does not do those, mostly. |
|