|
|
|
|
|
by ap22213
3578 days ago
|
|
"Remove the two temporary tables and combine all three Hive stages into a single Spark job that reads 60 TB of compressed data and performs a 90 TB shuffle and sort." "As far as we know, this is the largest real-world Spark job attempted in terms of shuffle data size" I'm far, far from a world class engineer, but I regularly do 90 TiB shuffle sorts. I must seriously be missing something, here. |
|
Out of the linked issues these all seem like they would be "easy" to hit given enough data:
https://issues.apache.org/jira/browse/SPARK-13279
https://issues.apache.org/jira/browse/SPARK-13850
https://issues.apache.org/jira/browse/SPARK-13958
https://issues.apache.org/jira/browse/SPARK-14363