What's probably more interesting is how similar .net, scala, and python are in query performance. Not sure if that can be attributed to great python performance, or really bad scala/.net performance.
Most of PySpark is simply telling the JVM what to do, it's not actually running python directly. UDFs are where the real differences are, and they mentioned CLR UDFs serialize the spark Rows 2x faster than Python, but it's not clear if they were using apache arrow enabled pandas UDFs which are 3x-100x faster:
Python 2.7? Also, the new Apache Arrow integration changes the python performance characteristics a lot; I wonder if they are using arrow for their JVM <-> CLR interop, if not that probably would be a good idea.
Your post made me curious and I raised the issue with MS at https://github.com/dotnet/spark/issues/45. I hope that benefits the community and gets MS on the right track (by finally supporting Arrow).
It would be interesting to hear about it from MS. Do you know of other settings / configurations / features that could greatly influence the result of such comparison?
Since we started replying to these points on the Github thread at https://github.com/dotnet/spark/issues/45, I am suggesting to continue the discussion there. As mentioned there, we want to be transparent with the benchmark code and systems we use. We are currently working on arrow support to compare fairly.
A large percentage of Spark code is really just assembling lego blocks. The built-in blocks are themselves all written in Java or Scala, and the performance of the code that stacks them together is negligible.
It's mainly when you start writing custom UDFs (IOW, fabricating your own lego blocks) that platform interop and the performance of your language of choice become a big deal.
https://databricks.com/blog/2017/10/30/introducing-vectorize...