Hacker News new | ask | show | jobs
by vvladymyrov 2620 days ago
I've seen announcement about .Net interior support in Apache Spark some time ago. The benchmarks are interesting and tell the story - in few cases it is faster than Python, but slower than native (for Spark) Scala/JVM. Maybe with Arrow interchange Python's performance would increase (and for other interpose that would use Array - i.e. for .Net).

But performance is not the only thing - there is also ability to debug issues. For this you still need to dig into Apache core which is in Scala.

This implementation in .Net would be "gateway drug" for moving your production to Scala/JVM.

It happened to me with PySpark - majority or tasks at hand can be solved with PySpark. But digging into the issues and stack traces brought me to Scala internals of Apache Spark. As a result in cases when python specific libraries are not needed and high performance needed I would write Spark programs Scala from the beginning.

2 comments

Same, I recently moved an ML pipeline from PySpark to pure Python because of the debug-ability issue. The data science team, who managed the project, were experts in Python but relatively weak in Scala/Java. There were many issues were an improper data type may blow up pickling in the Java side and return absolutely cryptic errors. It was also difficult to do any sort of integration test and profiling on the code - the start of moving off of Spark originally started as a way to do integration testing and profiling.
Indeed the real sad part is you can’t lead teams there early (premature optimization). Everybody seems to make the same rough transition on their own.
On a somewhat related note, for my purposes the real deciding factor in sticking with Scala/JVM for (production) Spark work is testability: With that setup, it's dead easy to fire up a local Spark context, run unit tests against it, and keep the tests running reasonably fast.
For python guys - pyspark is now also installable from pip as a package (it include some .jars so it is ~100Mb size python package). So my team for local unit tests installs pyspark as a package.
These are all good points. Debugability and general support for the development lifecycle are important. We are definitively working on providing first class development experiences for .NET developers. .NET for Apache Spark is already available as a nuget package for local install. We are currently working on adding support to VS Code, Visualstudio etc. Feel free to provide us your preferred dev platform. [Disclaimer: I am Program Manager for the .NET for Apache Spark effort]
Thanks for the response!

FWIW, I was speaking specifically to being able to run Spark, and manage its lifecycle, all inside the same process as the unit test code. Which is something that I'll openly concede isn't much more than a fun party trick for most people's purposes, but it does happen to serve me well.

In a past life, I was involved in data engineering at a .NET shop, and being able to migrate parts of our process to something like Spark without having to rewrite or otherwise severely damage it would have made me very happy. Even better if I could stay inside Visual Studio, and bang on it from an F# interactive session.

Wild speculation, but if you can produce type providers that know how to tame `DataSet[Row]`, you might have some nonzero number of F# hipsters like me kissing your feet.

(Or not. Like I said, my perspective on Spark is unusual.)

Thanks... More idiomatic F# support is on the roadmap