|
|
|
|
|
by epdlxjmonad
2234 days ago
|
|
I agree that Spark on Kubernetes will have a hard time fixing the problem of shuffling. If they choose to use local disks for per-node shuffle service, a performance issue arises because disk-caching is container-local. If they choose to use NFS to store shuffle files, a different kind of performance issue arises because of not using local disks for storing intermediate files. All these issues will arise without properly implementing fault tolerance in Spark. We are currently trying to fix the first problem in a different context (not Spark), where worker containers store intermediate shuffle files in local disks mounted as hostPath volumes. The performance penalty is about 50% compared with running everything natively. Besides occasionally some containers almost get stuck for a long time. I believe that the Spark community will encounter the same problem in the future if they choose to use local disks for storing intermediate files. |
|
Sources: - end of presentation https://www.slideshare.net/databricks/reliable-performance-a... - https://issues.apache.org/jira/browse/SPARK-25299