|
|
|
|
|
by wesm
2147 days ago
|
|
What you've written sounds like a criticism of the JVM data analytics ecosystem (the Java Parquet library in particular) and not Apache Arrow itself. Parquet for Java is an independent open source project and developer community. For example, you said > It's barely useable and the dependencies are horrific - the whole thing is mingled with hadoop dependencies - even the API itself. These are comments about http://github.com/apache/parquet-mr which is a different open source project. For C++ / Python / R many of the developers for both Apache Arrow and Apache Parquet are the same and we currently develop the Parquet codebase out of the Arrow source tree. So, I'm not sure what to tell you, we Arrow developers cannot take it upon ourselves to fix up the whole JVM data ecosystem. |
|
But at your landing page, it's claimed "Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. " and that "Libraries are available for C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.". This certainly gave me the impression that more than just Python, C++ and R would be well supported.
The JVM isn't complete irrelevant in data-science given the position of Spark/Scala. This also raised my expectations of arrow/parquet because it seems to be the de-facto standard for table storage for this JVM platform. And I experienced no issues on that platform.
To be clear, I'm not blaming you for my design decision (I'm a software engineer not a data-scientist btw), and I still think parquet/arrow rocks for Python but in my experience it doesn't really deliver a useable "cross-language" file format at the moment.