Hacker News new | ask | show | jobs
by ap22213 3578 days ago
"Remove the two temporary tables and combine all three Hive stages into a single Spark job that reads 60 TB of compressed data and performs a 90 TB shuffle and sort."

"As far as we know, this is the largest real-world Spark job attempted in terms of shuffle data size"

I'm far, far from a world class engineer, but I regularly do 90 TiB shuffle sorts. I must seriously be missing something, here.

2 comments

Have you run into any of the issues mentioned of the article? Some of them are regressions, which version of Spark were you running?

Out of the linked issues these all seem like they would be "easy" to hit given enough data:

https://issues.apache.org/jira/browse/SPARK-13279

https://issues.apache.org/jira/browse/SPARK-13850

https://issues.apache.org/jira/browse/SPARK-13958

https://issues.apache.org/jira/browse/SPARK-14363

Are you using Spark? That's the context.
Yes