Hacker News new | ask | show | jobs
by threeseed 406 days ago
You should try using an S3 based shuffle plugin: https://github.com/IBM/spark-s3-shuffle

Then mount FSX for Lustre on all of your EMR nodes and have it write shuffle data there. It will massively improve performance and shuffle issues will disappear.

Is expensive though. But you can offset the cost now because you can run entirely Spot instances for your workers as if you lose a node there's no recomputation of the shuffle data.