|
|
|
|
|
by glapark
2232 days ago
|
|
Shuffling in Spark works well for small datasets, but is not reliable for large datasets because fault tolerance in Spark is incomplete. For example, check this Jira: https://issues.apache.org/jira/browse/SPARK-20178 So, if your problem was mainly due to shuffle-heavy workload, then I guess no managed Spark service would be able to alleviate/eliminate it by automatic parameter tuning. In other words, your pain might be due to a fundamental problem in Spark itself. IMO, Spark is great, but its speed is no longer its key strength. For examples, Hive is much faster than SparkSQL these days. |
|
References: https://youtu.be/GbpMOaSlMJ4?t=1617 https://t.co/KWDNHjudfY?amp=1 https://issues.apache.org/jira/browse/SPARK-25299