Hacker News new | ask | show | jobs
by aeroevan 2620 days ago
Most of PySpark is simply telling the JVM what to do, it's not actually running python directly. UDFs are where the real differences are, and they mentioned CLR UDFs serialize the spark Rows 2x faster than Python, but it's not clear if they were using apache arrow enabled pandas UDFs which are 3x-100x faster:

https://databricks.com/blog/2017/10/30/introducing-vectorize...