| HN Mirror

Y	Hacker News new \| ask \| show \| jobs


	by diogofranco 2988 days ago
	pyspark might be the go-to language for data scientists playing with the spark repl, or MLLib, but for production data engineering, scala is still king. Besides performance and the obvious fact that not knowing scala makes it difficult to understand the underlying Spark code, there are multiple ways in which scala is more natural to develop in (many libraries are for scala only, for example).

1 comments

sandGorgon 2988 days ago

I don't think so. Python and data frames is arguably more natural to think about and reason than scala.

I have no doubt that scala is more performant and the "fat" jar mechanism makes dependency management and codeshipping very easy (it's still tricky to install python dependencies on your spark nodes), but the pandas ecosystem is definitely more intuitive to understand.

link

RBerenguel 2988 days ago

I have the impression you are leaning towards thinking of data analytics (pandas, data frames, etc) whereas I and some other commenters may be thinking more of more data pipelining kind of architectures, where you can't afford wrong typing, scale is quite large and you are not even doing the kind of operations pandas dataframes are useful for

link