Hacker News new | ask | show | jobs
by nivekkevin 1270 days ago
One significant disadvantage of PySpark is its reliance on py4j to serialize and deserialize objects between Java and Python when using Python UDFs. This constant overhead can become burdensome as data volume increases in such an exchange. However, I am glad to see efforts to create a data pipeline framework using Python and Ray.

~One suggestion, a Scala/Java Spark run of those benchmarks should be a valid baseline to compare against as well instead of PySpark.~ Ah it's SparkSQL so the execution probably wouldn't have much of py4j involvement, except for the collect.

2 comments

There is also pandas udfs, which uses arrow as the exchange format. I assume it still has to copy the data (?), but it makes the (de)serializarion fast, and allows for vectorized operations.

https://spark.apache.org/docs/3.0.0/sql-pyspark-pandas-with-...

None of the benchmarks involved any UDFs.