|
|
|
|
|
by nivekkevin
1270 days ago
|
|
One significant disadvantage of PySpark is its reliance on py4j to serialize and deserialize objects between Java and Python when using Python UDFs. This constant overhead can become burdensome as data volume increases in such an exchange. However, I am glad to see efforts to create a data pipeline framework using Python and Ray. ~One suggestion, a Scala/Java Spark run of those benchmarks should be a valid baseline to compare against as well instead of PySpark.~
Ah it's SparkSQL so the execution probably wouldn't have much of py4j involvement, except for the collect. |
|
https://spark.apache.org/docs/3.0.0/sql-pyspark-pandas-with-...