| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by kenhwang 2620 days ago
	What's probably more interesting is how similar .net, scala, and python are in query performance. Not sure if that can be attributed to great python performance, or really bad scala/.net performance.

3 comments

aeroevan 2620 days ago

Most of PySpark is simply telling the JVM what to do, it's not actually running python directly. UDFs are where the real differences are, and they mentioned CLR UDFs serialize the spark Rows 2x faster than Python, but it's not clear if they were using apache arrow enabled pandas UDFs which are 3x-100x faster:

https://databricks.com/blog/2017/10/30/introducing-vectorize...

link

imbac82 2620 days ago

https://devblogs.microsoft.com/dotnet/introducing-net-for-ap...

link

aeroevan 2620 days ago

Python 2.7? Also, the new Apache Arrow integration changes the python performance characteristics a lot; I wonder if they are using arrow for their JVM <-> CLR interop, if not that probably would be a good idea.

link

polskibus 2620 days ago

Your post made me curious and I raised the issue with MS at https://github.com/dotnet/spark/issues/45. I hope that benefits the community and gets MS on the right track (by finally supporting Arrow).

link

polskibus 2620 days ago

It would be interesting to hear about it from MS. Do you know of other settings / configurations / features that could greatly influence the result of such comparison?

link

MichaelRys 2620 days ago

Since we started replying to these points on the Github thread at https://github.com/dotnet/spark/issues/45, I am suggesting to continue the discussion there. As mentioned there, we want to be transparent with the benchmark code and systems we use. We are currently working on arrow support to compare fairly.

link

bunderbunder 2619 days ago

A large percentage of Spark code is really just assembling lego blocks. The built-in blocks are themselves all written in Java or Scala, and the performance of the code that stacks them together is negligible.

It's mainly when you start writing custom UDFs (IOW, fabricating your own lego blocks) that platform interop and the performance of your language of choice become a big deal.

link