|
|
|
|
|
by diogofranco
2988 days ago
|
|
pyspark might be the go-to language for data scientists playing with the spark repl, or MLLib, but for production data engineering, scala is still king. Besides performance and the obvious fact that not knowing scala makes it difficult to understand the underlying Spark code, there are multiple ways in which scala is more natural to develop in (many libraries are for scala only, for example). |
|
I have no doubt that scala is more performant and the "fat" jar mechanism makes dependency management and codeshipping very easy (it's still tricky to install python dependencies on your spark nodes), but the pandas ecosystem is definitely more intuitive to understand.