Hacker News new | ask | show | jobs
by bipin_nag 3853 days ago
Correct me if I am wrong but Spark does not use JNI for Python. It uses py4j to allow Python programs to access Java Objects. All the core functionality is implemented in Scala/Java. py4j uses sockets to communicate. See https://www.py4j.org/about.html
1 comments

That's my understanding as well. This is the reason why you are caught between the rock and a hard place on the Hadoop stack. Java code wont be efficient in terms of FLOPs and a popular and often effective escape hatch: code number crunching parts in C, C++ or Fortran, is also not very effective / convenient. Particularly so if it requires going back and forth over the bridge frequently because crossing the bridge has significant overhead. So typically you have to move majority of the core into the high performant language of choice and what remains is glue. If gluing is what I want to do, there are other languages that can give Java stiff competition. The core capability thing going for Java in this domain is HDFS, and its not that great a file system for big data.
When we say JVM code is not efficient in terms of FLOPs we're talking about a factor of 2 or so though, not really the same as the factor of 20+ you lose by implementing numeric code in pure Python. So Scala gives you a 10x faster cycle during early development. Maybe it's fast enough that you don't need that last factor of 2, or maybe just JNIing a few core operations is enough (you can also do some tricks with the advanced type system; with NumPy if it looks like Python iteration then it is, whereas with something like Breeze you can potentially have something that looks like Scala iteration but won't actually copy the data back and forth). But in the cases where you do need to push things right down into Fortran, Scala is just as good a glue language as Python is.
> When we say JVM code is not efficient in terms of FLOPs we're talking about a factor of 2 or so though, not really the same as the factor of 20+

Not so in my experience. Write a matrix multiply in pure Java, and pure C, C++, Fortran. The difference would easily be north of 5X, typically more.

Regarding JNI, please see the root of the comment tree. Rarely have I seen more stinky garbage. Its hopeless if you have to go back and forth across the bridge. If that be so I might as well be on the other side of the bridge.

JVM is fantastic, but if you are doing number crunching and performance matters, then JVM is a wrong choice.