| HN Mirror

Y	Hacker News new \| ask \| show \| jobs

by glapark 2232 days ago

Shuffling in Spark works well for small datasets, but is not reliable for large datasets because fault tolerance in Spark is incomplete. For example, check this Jira:

https://issues.apache.org/jira/browse/SPARK-20178

So, if your problem was mainly due to shuffle-heavy workload, then I guess no managed Spark service would be able to alleviate/eliminate it by automatic parameter tuning. In other words, your pain might be due to a fundamental problem in Spark itself.

IMO, Spark is great, but its speed is no longer its key strength. For examples, Hive is much faster than SparkSQL these days.

2 comments

jamesblonde 2231 days ago

It's worse than that. Shuffle for Spark on Kubernetes is fundamentally broken and hasn't yet been fixed. The problem is that Docker containers cannot (for security reasons) share the same host-level disks. There is no external shuffle service, and disk-caching is container-local (not using kernel-level disk I/O buffering) which kills performance. Google's proposed soln below is to use NFS to store shuffle files, which is not going to be performant. Stick with YARN for Spark and only switch when shuffle is fixed for k8s. Databricks are in no rush to get shuffle fixed for k8s.

References: https://youtu.be/GbpMOaSlMJ4?t=1617 https://t.co/KWDNHjudfY?amp=1 https://issues.apache.org/jira/browse/SPARK-25299

link

epdlxjmonad 2231 days ago

I agree that Spark on Kubernetes will have a hard time fixing the problem of shuffling. If they choose to use local disks for per-node shuffle service, a performance issue arises because disk-caching is container-local. If they choose to use NFS to store shuffle files, a different kind of performance issue arises because of not using local disks for storing intermediate files. All these issues will arise without properly implementing fault tolerance in Spark.

We are currently trying to fix the first problem in a different context (not Spark), where worker containers store intermediate shuffle files in local disks mounted as hostPath volumes. The performance penalty is about 50% compared with running everything natively. Besides occasionally some containers almost get stuck for a long time. I believe that the Spark community will encounter the same problem in the future if they choose to use local disks for storing intermediate files.

link

jstephan 2229 days ago

Glad our post sparked some pretty deep discussions on the future of spark-on-k8s ! The OS community is working on several projects to help this problem. You've mentioned NFS (by Google) but there's also the possibility to use object storage. Mappers would first write to local disks, and then the shuffle data would be async moved to the cloud.

Sources: - end of presentation https://www.slideshare.net/databricks/reliable-performance-a... - https://issues.apache.org/jira/browse/SPARK-25299

link

ramraj07 2232 days ago

I completely moved away from spark into snowflake due to this reason. It's failure modes seem to become far more predictable and you learn to become significantly more productive with it even though it's all pure SQL

link