Hacker News new | ask | show | jobs
by wesm 2147 days ago
What you've written sounds like a criticism of the JVM data analytics ecosystem (the Java Parquet library in particular) and not Apache Arrow itself. Parquet for Java is an independent open source project and developer community. For example, you said

> It's barely useable and the dependencies are horrific - the whole thing is mingled with hadoop dependencies - even the API itself.

These are comments about http://github.com/apache/parquet-mr which is a different open source project.

For C++ / Python / R many of the developers for both Apache Arrow and Apache Parquet are the same and we currently develop the Parquet codebase out of the Arrow source tree.

So, I'm not sure what to tell you, we Arrow developers cannot take it upon ourselves to fix up the whole JVM data ecosystem.

2 comments

I'm not expecting anything really and I do appreciate your work and effort. And it's a specific use case for arrow, I guess.

But at your landing page, it's claimed "Apache Arrow defines a language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware like CPUs and GPUs. " and that "Libraries are available for C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust.". This certainly gave me the impression that more than just Python, C++ and R would be well supported.

The JVM isn't complete irrelevant in data-science given the position of Spark/Scala. This also raised my expectations of arrow/parquet because it seems to be the de-facto standard for table storage for this JVM platform. And I experienced no issues on that platform.

To be clear, I'm not blaming you for my design decision (I'm a software engineer not a data-scientist btw), and I still think parquet/arrow rocks for Python but in my experience it doesn't really deliver a useable "cross-language" file format at the moment.

Again, I have to object to your use of “arrow/parquet”. These are not the same open source projects and while people use them together it isn’t fair to the developers of each project for you to discuss them like a single project.
FWIW, while the JVM isn't completely irrelevant in data, I will say, even as a big user of Spark via Scala, that JVM languages are quickly becoming irrelevant in data. Spark's Scala API is simultaneously the core of the platform, and also very much a second-class citizen that lacks a lot of important features that the Python API has. Easy interop with a good math library, for example.

Similarly, the reference implementation of Parquet may be in Java, but consuming it from a Java language, outside of a Spark cluster, is still a royal pain. Whereas doing it from Python isn't too bad.

Long story short, I think that expecting a project that's just trying to implement a columnar memory format to also muck out the world's filthiest elephant pen is perhaps asking too much. Though perhaps a project like Arrow could serve as the cornerstone of an effort to douse it all with kerosene and make a fresh start.

I spent a couple of years doing consultancy for life sciences research labs, most people were just using Excel and Tableau, plugged into OLAP, SQL servers, alongside Java and .NET based stores.

Stuff like Arrow doesn't come even into the radar of IT.

You do raise a very important point. At my organisation, Apache Avro was selected by the Java Devs due to the "cross-platform" marketing. However, they found out after it was too late, that the C/C++ implementations were too buggy/incomplete to effectively interoperate with the Java versions.
Keep in mind that Arrow Java<->C++/Python interop has been in production use in Apache Spark and elsewhere for multiple years now. We have avoided some of the mistakes of past projects by really emphasizing protocol integration tests across the implementations.
I believe that the Spark parquet library is available to be used in plain old Java: https://www.arm64.ca/post/reading-parquet-files-java/
Just for Apache Arrow itself, https://arrow.apache.org/docs/java/ compared to https://arrow.apache.org/docs/js/ or https://arrow.apache.org/docs/cpp/ doesn't look promising in terms of the documentation being usable.

That could be improved without fixing the whole JVM data ecosystem, but that's mostly up to JVM developers. It's unfortunate if the Spark developers using Arrow aren't contributing in this area (especially since many of them are being paid), but it's all open source and undoubtedly pull requests are welcome.

Congratulations on the 1.0 release, it's only going to keep getting better! Really exciting to be able to share data in-memory across languages.