| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by RBerenguel 2983 days ago
	Although we decided to start using Scala specifically because PySpark was not as performant (2.0 is not so far ago), a reasonable use case I keep always in mind is aggregation (and in general any API which is still not solid/experimental/under work). Python bindings are always the last to be available (because all groundwork is being done in Scala). We have a relatively large scale process that takes advantage of custom-built aggregation methods on top of groupedDatasets, where we can pack a good deal of logic in the merge and reduce steps of aggregation. We could replicate this in Python using reducers, but aggregating makes more sense semantically, which makes the code is easier to understand. Also, the testing facilities for Spark code under Scala are a bit more advanced than under Python (they are not super-great, but are better), even without considering that being strongly typed makes a whole kind of errors impossible, right out of the compiler. I very, very rarely think of using PySpark (and I have way more experience with Python than with Scala) when working with Spark. In a kitchen setting, it would be like having to prepare a cake and having to choose between a fork and a whisker. I can get it done with the fork, but I'll do a better and faster job with the whisker.

1 comments

sandGorgon 2983 days ago

I will stay away from veering into a statically typed vs dynamically typed conversation here ;)

But I'm very excited about pyspark 2.3 UDF bringing grouped map . It will be interesting to hear your views on that https://databricks.com/blog/2017/10/30/introducing-vectorize...

link

RBerenguel 2983 days ago

Only checked the implementation of the "Arrow UDFs" recently, because I'm interested in the Arrow interaction (for curiosity), so still don't have a strong opinion. My main concern is that a lot of the PySpark systems are playing around how to interact and speed up the systems while still staying on top of the Scala base.

I'd recommend Dask (haven't tried it much but from all I've seen is top-notch) to anyone who wants Python all the way down (at least until you hit the C at the bottom) ;)

link

sandGorgon 2983 days ago

well we run a hundred machine cluster on Dataproc for doing our stuff. Dask is still not battle-tested, cloud ready (or available) and is generally harder to work with than pyspark.

In general, I will stay happily in the spark world using pyspark rather than go to Dask right now.

link